Compelling Content Inspires Unconference Participants at Digital Data II

Mon, 2018-06-25 14:25 -- dpaul

After two days of soaking up talks and plenary keynotes focused on the theme of Emerging Innovations for Biodiversity Data, 20 participants gathered together to share what ideas inspired them.

The backgrounds of our group varied widely from
microbiology, marine science, ecology, bryology,
geology, paleontology, collection management, data science, engineering, computer science, botany, and more! Everyone contributed one or more topics of interest for and then we invited everyone to stand in groups next to the ideas that interested them the most. Four topics emerged to form the basis for our groups: Skills and Training, Taxon Concepts, Team Meta, and Data Integration. We then set out to gather our inspirations and look for next steps.

In the Skills and Training group, we all see the need to 1) look at what is missing in current training materials, 2) develop a Carpentries lesson specific for natural sciences collections data, and 3) keep data applications (like the ones showcased at Digital Data II) in mind to engage a broader community. Read our team notes for the details and find out what we see as missing or in need of further development in an overall curriculum for those creating and using collections data.

Taxon Concepts are a (huge!) challenge to current data sharing and aggregation methods. While new specimens and observations may contain robust identifications, legacy data often lacks vital taxon concept information. The Taxon Concept group noted that Cam Webb gave a good talk at Digital Data II on developing schema to relate concepts. They suggest that even with such mappings, we still need to decide which specimens should be assigned to which concept.  Ideas that have been raised to approach the latter problem include:

  • Date the identification was made (in hopes that there was a uniformly accepted concept at that time -- certainly not always true).
  • An assessment of the diagnostic characters of the specimen (assuming the taxonomist presented such, and that they can be seen in a specimen image).
  • Using the specimens known to have been identified by the author of the concept, via annotation history, as a “gold standard” representing her/his taxon concept.  This could possibly be  combined with the preceding approach by using image processing to find other specimens that match the gold standard.  At minimum, users wanting only correctly identified specimens (according to a particular concept) could restrict themselves to the gold standard specimens, perhaps expanding the set of specimens by including those identified by experts who are trusted to be using a particular concept (the “silver standard”).
  • Using geographic location in the many cases where taxon splits have resulted in allopatric distributions of segregates (e.g., west and east of the Sierra).
  • The PhyloCode, to be published soon (https://www.ohio.edu/phylocode/) could solve a number of these issues going forward in the future, but many of the legacy problems will remain.

Team Metadata discussed the challenges associated with accessing and using scientific metadata. Currently, most repositories use machine-readable metadata formats such as Ecological Metadata Language (EML), which are unfamiliar to most scientists. Thus, we hope enable conversions from current metadata structures to lightweight tabular formats, as well as from tabular data (back) to EML. Much of the groundwork has already been laid by the dataspice group from the rOpenSci unconf (https://github.com/ropenscilabs/dataspice). We propose to:

  • Revise the currently proposed dataspice tables to be more comprehensive
    • Add identifiers to access.csv
    • Add missing values to attributes.csv
    • Further explore how to standardize unit information provided in the unitText field
    • Add roles to creators.csv (possibly "author" and "contributor")
    • Require (?) ORCIDs
  • Develop a validation method for checking tabular metadata against some standard
  • Convert between tables and EML
  • Explore other formats for storing metadata (markdown?)

Data Integration Group
We talked about the issues of linking heterogeneous data for analyses. One example given was combining citizen science data with census data to find patterns in participant behavior/demographics. Often the researcher/analyst must create new linkages between datasets that are not explicitly contained in the data.  

One commonly occurring linkage we discussed was geospatial location. Researchers might know from their experience that datasets with implicit spatial information can be viewed, combined and processed in a GIS. It would be better if the geospatial data were made more explicit (e.g. as a separate table column containing latitude/longitude coordinates or bounding box information) to facilitate data usability.

We discussed the need for truly globally unique identifiers for datasets that are unique not only within organizations but also across organizations. One benefit is this will ensure that when different researchers refer to an object that they are referring to the same object.

During our discussions, we also realized that we need better catalogs of available data with the possibility for instructions and notes on how to relate them to each other.

Major Topics Suggested by Participants [gleaned from participant post it notes].

On our minds. Machine Learning, Data Visualization, New Applications, Large Datasets, Data Integration, Data Standards, Taxon Concepts, Resource Catalogs, Data Skills and Where Do They Come From, Diversity and Inclusion in Collections, Ontologies for Linking, Computable Trait Data, Data Management for Collections and Citizen Science Data, Open Data, Improving Data Modeling, and Data Quality. May these inspire activities at Digital Data III and beyond.

Content in this blog post contributed collaboratively by the participants. Deborah Paul, Ed.