Shining a New Light on the World’s Collections

From Vince Smith (NHM), Deborah Paul (iDigBio), Matt Woodburn (NHM), Sharon Grant (FM), Randy Singer (iDigBio), Kevin Love (iDigBio)
Introduction.
Picture a time in the future when we can look online at anytime to see how many collections are digitized, georeferenced, and published and how many have yet to digitize – planet-wide. Perhaps you want to know what is unique about your collection (digitized or not) compared to others worldwide. Or maybe you seek taxonomic expertise. Multiple efforts underway across the world seek to build dynamic visualization tools to help collections dynamically grasp the range of their holdings. These approaches vary in scope and method, but all seek to help the museum and collections world to more effectively collectively quantify and qualify specimens held. These efforts go beyond a list or catalogue such as GRBio.org in that they offer ways to create graphic data, on demand. At the moment, only some of them are at least partly automated, using published collections metadata to draw from. Most of them require humans to input the necessary data.
Known efforts currently underway include:

One World Collection (OWC): The OWC project seeks to characterize global natural history collections and their related staffing and research expertise at a high level.  As of August 2018, 76 institutions with collections estimated to include 1.2 billion natural history specimens are hard at work gathering data they plan to use as the basis for a research paper targeted for publication in 2019.  More information about this project will be shared in the coming months.

CETAF Passports:The CETAF (Consortium of European Taxonomic Facilities) network comprises 33 members representing 59 of the largest taxonomic institutions from 21 European countries. Between them, the CETAF membership’s collections comprise an estimated 1.5 billion specimens and represent more than 80% of the world’s described species.  Each member is requested to contribute information to a CETAF Passport, and outreach is carried out annually to encourage members to manually update their information.

The collections element of data are fairly concise and high-level, but there is also a wealth of additional information for each member about the institution, facilities, staff and budgets. With the exception of financial information, CETAF Passports data are freely available via the CETAF website via the search page or by browsing institutional profiles.

Join the Dots / Move the Dots: A detailed collections assessment methodology pioneered by the Smithsonian and further developed by the NHM, in which the collection is divided into distinct units (about 3000 for NHM) according to various criteria such as preservation method, taxonomy and stratigraphy. In addition to size counts or estimates, each unit is scored 1-5 against a range of criteria relating to condition, importance and significance, documentation, digitization and outreach. Metrics calculated from these assessments can be used to inform planning and assess impact of activities such as collections moves and digitization projects.

The original ‘Move the Dots’ assessment was carried out by the Smithsonian in 2009 and has been updated annually since then, with data contributing to the Smithsonian’s public National Collections dashboard. This ongoing assessment enables monitoring of how these indicators change over time (hence the name ‘Move the Dots’, referring to the scores plotted on a series of dot charts that show how subsets of the collection ‘move’ over different time periods).  The NHM’s ‘Join the Dots’ version was first piloted in 2016, and is now in its second year of data collection. Work continues on developing technical infrastructure and automation to support the project as a business as usual activity, including dynamic dashboards for internal use and public release.

The initial assessment can be quite resource-intensive, and the Smithsonian and NHM have both invested heavily in the process. The cross-institutional utility is heavily predicated on what collection units you are scoring as many will be institutionally specific, but at least for NHM many of the units can now be aggregated by taxonomy, object type and storage location.

CSAT: Collections Self Assessment Tool. This was developed by SYNTHESYS as a quick mechanism for institutions to self-assess their collection as part of admission to the SYNTHESYS consortium. These are periodically "audited" by other SYNTHESYS partners and the outputs are intended for collections users to have a better understanding of their collections and the facilities available across partner institutions.

A National Resource. At iDigBio, we built a Collections Catalog intended to be a comprehensive list of all natural history collections in the United States of America. With this resource, you can search by the usual items (institution name, person, etc.). You can see a map of all collections we know about across the US and update information where needed. Another unique feature of this software makes it possible to link the published specimen records to the collection information entry. This facilitates discovery of which collections have published specimen-level data. If you know of institutions or collections that are not shown in the list, please complete this form. If you have any questions or would like know more about the API, for example, please email Kevin Love (iDigBio) as klove AT flmnh DOT ufl DOT edu

We are currently using this aggregated data to identify collections at minority-serving institutions (MSI) with the intent to strategically focus on efforts to support their data mobilization efforts. Additionally, we plan to document any MSI data-mobilization roadblocks so that we can address them with training materials or inform those who can. Information in this catalog also allows us to successfully target our survey and outreach efforts (as measured by response rates). For example, we can better understand how many collections (and what type) have not yet been able to create and share digital biocollections data. Now, and with future developments, this data means we can more effectively evaluate and communicate ADBC program progress toward facilitating research, expanding and enhancing stakeholder inclusion, and embedding digitization in collections as a standard of practice.

 

Index Herbariorum (IH). In this resource, it's one place to look to discover the world's herbaria. New software makes it easy for herbaria to sign up themselves and update their entries, simplifying the process of keeping the data current. The botany world began organizing and collecting this information over 80 years ago, recognizing the benefits to be had by compiling this data. From their website: "as of December 1 2017, there are 3,001 active herbaria in the world today, with approximately 12,174 associated curators and biodiversity specialists. Collectively the world's herbaria contain an estimated 387,007,790 specimens that document the earth's vegetation for the past 400 years. Index Herbariorum is a guide to this crucial resource for biodiversity science and conservation."

 

Field Museum. The FMNH prototype system is an attempt to uncover data in an institution's archives that hasn’t been captured at Occurrence-level.  The prototype’s development focuses on function, not form, so design-wise it is very plain and data remains largely un-filtered and untransformed.

The tool has two overarching purposes.  First, it investigates how best to provide any user with the ability to retrieve statistics about collections (digitized or not) that are relevant to them.  Its second purpose is to test the application of existing data standards to estimate and compare numbers of digitized and undigitized items (backlog).  Visualizations of descriptive statistics are merely a few examples of what collections staff felt were useful at the time. This resource is public; the datasets will change over time.

 

A different approach, a different angle.

FishFindR.net. With this sample tool, the developers take a different tack by focusing only on what collections specimen data is published and making it possible to compare one published collection with another. You can check out the website to explore data from institutional fish collections integrated in iDigBio. Using iDigBio’s specimen data API, these data can be retrieved without needing to interview individual collections staff​ about the details of their holdings. Instead they can query their published data and communicate with the collections staff to validate the numbers and work to resolve data discrepancies. The data can be used in near-real time to give the most accurate snapshot of current collection holdings as they are published. This prototype website provides a framework in which all digitized collections can be analyzed and compared using data from iDigBio's API's. (see CD Use Cases for more information about this tool and contact Randy Singer (Graduate Student) and Kevin Love (Biodiversity Informatics Manager), both at iDigBio.

Key points.

The examples above differ widely when looking at the audience, features, and formats. Note that in the first four cases, while all differ to a degree in terms of scope, focus and participation, there is still a considerable amount of conceptual overlap, with some institutions (e.g. NHM) involved in many of them. All still have room for refinement in places, although most are either planning or actively undertaking that work at this point in time. At least for the first four, the time taken to create the platforms and update the information takes more resources (time and staff) than originally envisaged.  Some of these tools focus on the the need to share collection level data for a particular group, region (national, international) or program such as One World Collection and CETAF Passports. Efforts such as Join / Move the Dots require a considerable investment of time and effort but have great potential for providing a deeper view into undigitized collections, especially if there would be a common mechanism to complete and compare assessments.

 

Toward Automation. The various initiatives do not yet share data structures or make extensive use of shared data standards (where they exist). As a result, while very useful internally, the value often remains mostly unseen outside the organisations concerned. To share this data across these (and other platforms like Index Herbariorum, the iDigBio Collections Catalog, and GRBio.org) we need a standard, like Darwin Core. A group at Biodiversity Information Standards (TDWG) is working on a Collection Descriptions standard now - for this very purpose. With the exception of FishFindR.net, none of these efforts use much automation yet, and so rely heavily on manual data entry and curation. The FishFindR.net tool is almost fully automated, but intentionally focuses only on published specimen data. It doesn't include those collections who have yet to publish any specimen-level data.

To create a more automated resource we need denominators. Most collections do periodic inventories at some level (drawer, species, lot count etc.). For all collections, published or not, these estimates of physical holdings offer the denominators needed to automate estimates of digitization progress and physical collections care needs. With this information, people can better plan for filling (e.g. taxonomic, geographic, curation, etc., ...) gaps, seeking funding, and meeting researcher, conservation, outreach, and policy needs. By combining collection-level with specimen-level data in this fashion, we can more quickly and easily determine what’s left to be done. The Field Museum, Join the Dots, and CETAF Passport initiatives give us ideas for what we will able to do. Imagine getting those monthly/yearly status reports done in a click. And next time someone wants to do a collections digitization status survey - you can point them to these resources instead!

Others plan to do further work on integrating the vision, scope, and functionality of the first four tools. ICEDIG plans to further define what a first version collections description dashboard should do and work continues on this now that DiSSCo is funded. SYNTHESYS+, newly funded, takes on the integration task as a 4-year activity.

GBIF is taking over GRBio.org and GRSciColl and plans to build a resource that will coalesce all of this worldwide collections-level metadata. The authors look forward to the time when we are able to join the information about the published and not-yet-published collections – together to get the most accurate collections status worldview. For this, we will need humans and technology. SYNTHESYS+ will play a key part in making this a reality.

 

References.

CETAF (Consortium of European Taxonomic Facilities) (2018) https://www.cetaf.org/about-us/what-cetaf Accessed 1Sept 2018.

DiSSCo - Distributed System of Scientific Collections (2018) http://www.dissco.eu/content/about-us Accessed Sept 2018.

Field Museum Protoype System (2018) https://collections-dashboard.fieldmuseum.org/ Accessed Sept 2018.

FishFindR.net (2018). http://fishfindr.net/ Accessed Sept 2018.

Smithsonian National Collections Dashboard (2018) https://www.si.edu/dashboard/national-collections Accessed Sept 2018.

Synthesys+ Synthesis of Systematic Resources (2018) http://www.dissco.eu/content/our-projects Accessed Sept 2018. [note direct site for this project being updated at time writing].

Index Herbariorum: A global directory of public herbaria and associated staff. New York Botanical Garden's Virtual Herbarium. http://sweetgum.nybg.org/science/ih/. Thiers, B. [continuously updated]. Accessed Sept 2018.

US Collections Catalog at iDigBio (2018) https://www.idigbio.org/portal/collections Accessed Sept 2018.