Playing with biological specimen data in iDigBio – limitations and solutions for research
Puerto Rico – warm Caribbean seas, high biodiversity, and coqui frogs. iDigBio was invited to NatureServe’s Biodiversity without Boundaries 2016 meeting in April 2016 to share ideas and resources with members of the conservation community. Being the lucky speaker given the task of showcasing the Advancing Digitization of Biological Collections program, iDigBio and its communtiy, and the biological specimen data aggregated and shared by iDigBio, I decided to highlight the records and media available for Puerto Rico. We often hear within the research community of the issues associated with the biological data delivered by iDigBio and other aggregators. My first discovery was the different classifications of the higher geography of the region. Puerto Rico is considered as both a country (Commonwealth of Puerto Rico) as well as a territory of USA within the data served by iDigBio. Hmmm. Using the iDigBio portal, this required two separate data downloads to capture all of the data – about 160,000 records. One solution here is to use the iDigBio API to request data, formulating a search query for a bulk download. iDigBio is also working to provide another indexed field – the ISO country code – which may help resolve the issue. Using the rectangle polygon tool to capture the region of Puerto Rico within the iDigBio data portal, only about 70,000 records were left after the filter. What did this mean? About 50% of the records within iDigBio for the island of Puerto Rico and surrounding smaller islands had a geopoint. Immediately, as a data user, I was losing half the data for analysis. The solution would be to georeference the specimens using a tool like GeoLocate. The current solution to this issue is for data providers – the collections housing the biological resources – to georeference the locality information and submit it to iDigBio, and researchers can help by repatriating georeferenced data to the original collections. Information about georeferencing can be found on the Georeferencing Wiki, and a Georeferencing Short Course for Research Use is taking place October 4-7, 2016 is an upcoming training opportunity.
Primary biodiversity data for Puerto Rico shared through iDIgBio: plants (green), animals (purple), fungi (orange) and others (yellow).
Curious to know what broad taxonomic ranks were represented within the data, the field dwc:kingdom was the next challenge. Data cleaning was still needed as the iDigBio indexing algorithms were only able to provide data when the field was not blank. Some of the data could be retrieved from the fields dwc:higherClassification and dwc:class, but again, having the data providers supply this information is the best solution. The results for Puerto Rico were interesting – the largest proportion of specimen records were of herbarium specimens – and the majority of these were derived from a single institution: New York Botanical Garden. Along with particular digitization projects focusing on the Caribbean, the representation of plants, animals and fungi primary biological data in iDigBio for the region is indicative the focus of the efforts of the various Thematic Collections Networks (TCNs). Collections within Puerto Rico are, as yet, unrepresented, and iDigBio is communicating with the collections community to see how their regionally-focused collections can become more visible and accessible for research and curation. So, what can be done about the lack of geopoints, issues with taxonomy, and issues with indexing geography? What are some practical solutions? The cyberinfrastructure iDigBio team at ACIS are working on some indexing solutions, but the biggest help is for the iDigBio community data providers to use the Data Quality Flags to assist with cleaning data before resubmission to iDigBio, and for researchers to work collaboratively with the collections community to help improve the data. Contributing to one of the many iDigBio working groups is another way to help resolve these data issues. Working as a community, utilizing data standards, and sharing biodiversity data from even the smallest collections helps to build a more robust data set for research, collections curation, and other downstream uses.
Proportion of data from Puerto Rico within iDIgBio by collection recordset size (left) and by kingdom (right).
-- Contributed by Shelley A James