Data Problems

From iDigBio
Revision as of 10:26, 9 February 2015 by Joanna (talk | contribs) (→‎Anecdotes)
Jump to navigation Jump to search

The following are anecdotes contributed by users of iDigBio's data. They aim to be helpful in several ways:

  1. Anyone submitting data should read them and make adjustments and improvements to avoid the issues
  2. They can be a springboard for interested parties to address the overall data quality issue

Anecdotes

  • Your Darwin Core archives have all the information we need, but filtering out all the label images will be a challenge. We may have to employ a content-based image retrieval algorithm for the collections that have both label only and organism images, and this may take a while to develop.
    • I was surprised to find a creative commons license link in the dcterms:rights field of occurrence files. In the files I looked at (e.g., Recordset 69037495-438d-4dba-bf0f-4878073766f1), there is no dwc:rightsHolder entry in the occurrence file, so it appears that there is a license, but the licensor is not named? If these occurrences really have license restrictions, this complicates things for us. Our data model treats the image + metadata as one media object, and we cannot accommodate different licenses. If the media & occurrence licenses are always the same, it wouldn't be a problem, but in cases where they are different, we could not use the data from the occurrence file. This means descriptions and locality information could not be displayed alongside the image on EOL, and they would not be available through the EOL API, which considerably decreases the value of these images to our users.
    • Also, we would not be able to use label data in TraitBank if the occurrences are licensed. While we recognize licenses at the data set level, we do not implement them at the level of individual records. We have had discussions about this and came to the conclusion that like measurements and facts, occurrence records are unlikely to be protected by copyright, especially when they are presented in a commonly used standard like DwC. Of course, we won't know for sure until somebody files a lawsuit. But we decided to err on the side of openness. Is there any chance this issue could be brought up for discussion at iDigBio?
    • We'll have a little more work to do before we're ready to import any of the iDigBio data. I'll let you know if there is any progress on our end. (K. Schultz, EOL)
  • It seems that institutions have their own "unique" fields that they haven't equated with DwC (or the existing iDigBio fields) and so there are fields that probably could fit an existing field but don't. It might be useful to have a description from you or the institution as to what the field is so the user can merge info from two fields to reduce the number of "variables" (which many of the fields are in an analysis). Also it might be useful to request basic formatting standard with a field that has multiple bits of information contained within it so it makes it easier to parse (like using "|" as a separator) or merge and standardize.
  • I found the data very difficult to work with for the pilot study on treehoppers. It took me over a week to clean it up and put like information together and standardize information so it could be used in an analysis - this includes dates, common names, scientific names, higher taxonomy. And, as Katja mentioned, if you search the portal on family name but the record doesn't have a higher taxonomic designation, you miss all those records and no one wants to search by hundreds of genus or species names one by one to make sure they are all there. Records should absolutely contain Order, Suborder, Family, Subfamily, Tribe (if appropriate) and genus names.
  • It seems that most people view these data as species page information. However, if you try to use it to do an analysis, the format doesn't work well. (C. Johnson, AEC)
  • Download format and term definitions
    • The columns after download are not in logical order. All columns that are identifiers should be clustered together, locality information clustered together, collecting event clustered etc. Within the clusters the data elements can be in a loose order, but the elements should be together.
    • Several terms are included in the download that represent the same information, but are named only slightly different (ex. VerbatimEventDate, verbatimEventDate). These should be merged in the download file or at least returned next to each other in the download file.
    • There is no document that defines the terms. One should be provided. Further, those definitions should have URI identifiers so that individuals can reuse them with confidence (including them in a meta.xml).
  • Portal behavior
    • When searching the portal, certain fields should not be an exact match. These include Collector and Locality fields. There are others, but these were the most limiting.
    • Higher taxonomy should be included to improve the search. Family name being the most important. If it is not in the dataset from the provider, it should automatically be added upon ingestion to iDigBio. Without the higher taxonomy, a user will miss specimen records they are likely looking for.
  • Minor issues
    • Terms should be evaluated for continuity. The term “row number” contains a space.
    • Ideally would like a tsv as well as a csv download. (K. Seltmann, R. Rabeler, TTD TCN)