Data Problems

From iDigBio
Jump to: navigation, search

The following are anecdotes contributed by users of iDigBio's data and data portal. They aim to be helpful in several ways:

  1. Anyone submitting data should read them and make adjustments and improvements in their own data to avoid the issues found by others.
  2. They can be a springboard for interested parties to address overall data quality issues.
  3. This is also the place to document portal interaction difficulties.

iDigBio interest groups are aware of this documentation and feed it back to their respective groups. Progress and feedback from the developers is also noted here.

Another useful feature to take note of are iDigBio's data flags - common data quality issues and data corrections that may be performed on recordsets to improve the capabilities of iDigBio Search (see https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags). Data quality flags are identified for each of the ingested recordsets on their respective portal webpage.

User Anecdotes

Anecdote Contact
  • Your Darwin Core archives have all the information we need, but filtering out all the label images will be a challenge. We may have to employ a content-based image retrieval algorithm for the collections that have both label only and organism images, and this may take a while to develop.
    • I was surprised to find a creative commons license link in the dcterms:rights field of occurrence files. In the files I looked at (e.g., Recordset 69037495-438d-4dba-bf0f-4878073766f1), there is no dwc:rightsHolder entry in the occurrence file, so it appears that there is a license, but the licensor is not named? If these occurrences really have license restrictions, this complicates things for us. Our data model treats the image + metadata as one media object, and we cannot accommodate different licenses. If the media & occurrence licenses are always the same, it wouldn't be a problem, but in cases where they are different, we could not use the data from the occurrence file. This means descriptions and locality information could not be displayed alongside the image on EOL, and they would not be available through the EOL API, which considerably decreases the value of these images to our users.
    • Also, we would not be able to use label data in TraitBank if the occurrences are licensed. While we recognize licenses at the data set level, we do not implement them at the level of individual records. We have had discussions about this and came to the conclusion that like measurements and facts, occurrence records are unlikely to be protected by copyright, especially when they are presented in a commonly used standard like DwC. Of course, we won't know for sure until somebody files a lawsuit. But we decided to err on the side of openness. Is there any chance this issue could be brought up for discussion at iDigBio?
    • We'll have a little more work to do before we're ready to import any of the iDigBio data. I'll let you know if there is any progress on our end.
K. Schultz, EOL (2014) https://www.idigbio.org/redmine/issues/1393
  • It seems that institutions have their own "unique" fields that they haven't equated with DwC (or the existing iDigBio fields) and so there are fields that probably could fit an existing field but don't. It might be useful to have a description from you or the institution as to what the field is so the user can merge info from two fields to reduce the number of "variables" (which many of the fields are in an analysis). Also it might be useful to request basic formatting standard with a field that has multiple bits of information contained within it so it makes it easier to parse (like using "|" as a separator) or merge and standardize.
  • I found the data very difficult to work with for the pilot study on treehoppers. It took me over a week to clean it up and put like information together and standardize information so it could be used in an analysis - this includes dates, common names, scientific names, higher taxonomy. And, as Katja mentioned, if you search the portal on family name but the record doesn't have a higher taxonomic designation, you miss all those records and no one wants to search by hundreds of genus or species names one by one to make sure they are all there. Records should absolutely contain Order, Suborder, Family, Subfamily, Tribe (if appropriate) and genus names.
  • It seems that most people view these data as species page information. However, if you try to use it to do an analysis, the format doesn't work well.
C. Johnson, AEC (2/2015) https://www.idigbio.org/redmine/issues/1394
  • Download format and term definitions
    • The columns after download are not in logical order. All columns that are identifiers should be clustered together, locality information clustered together, collecting event clustered etc. Within the clusters the data elements can be in a loose order, but the elements should be together.
    • Several terms are included in the download that represent the same information, but are named only slightly different (ex. VerbatimEventDate, verbatimEventDate). These should be merged in the download file or at least returned next to each other in the download file. (fixed in next release of portal)
    • There is no document that defines the terms. One should be provided. Further, those definitions should have URI identifiers so that individuals can reuse them with confidence (including them in a meta.xml).
  • Portal behavior
    • When searching the portal, certain fields should not be an exact match. These include Collector and Locality fields. There are others, but these were the most limiting.
    • Higher taxonomy should be included to improve the search. Family name being the most important. If it is not in the dataset from the provider, it should automatically be added upon ingestion to iDigBio. Without the higher taxonomy, a user will miss specimen records they are likely looking for. (improvements coming in next release of portal - using GBIF Nub taxonomy)
  • Minor issues
    • Terms should be evaluated for continuity. The term “row number” contains a space.
    • Ideally would like a tsv as well as a csv download. (support for tsv export format is coming in next release of portal)
K. Seltmann, R. Rabeler, TTD TCN (2/2015) https://www.idigbio.org/redmine/issues/1395

One thing we might think of in any attempt to create a national vascular plant portal - how easy is it for the user to acquire data. I had this idea after conducting a simple search for records of one species in each of four portals: iDigBio, SEINet, CPNWH, and the California Consortium.

How many clicks/windows does the user have to navigate to get the results? Ideally it should be minimal. I found in one case (CPNWH: http://www.pnwherbaria.org/data/search.php) it was - after I entered the name and hit "search", I had label data, a map of georeferenced specimens, and any extant images, even records entered under synonyms. I could easily examine label data, sometimes including annotations (!), enlarge images, the map, etc. In other portals, I had to make additional input to get one or more of types of data available.

The Pacific Northwest portal allows, as one option, grouping records, incl. sorting by collection and number. I found this quite useful - it's a way to spot specimens that are likely duplicates that may, for various reasons, be filed under different names. Someone doing monographic work can learn of (or examine if they are imaged) specimens they may not otherwise know about - the commenting/ Filtered Push system where one could send comments to all owners of the duplicates simultaneously would be a wonderful addition to any such portal."

Brent Mishler replied to my original post and pointed out some of the advantages of the data display in the Consortium of California Herbaria (http://ucjeps.berkeley.edu/consortium/) page.

"The table view used by CCH has advantages in that it can be sorted quickly and easily on any column. CPNWH certainly does have an attractive search return pages -- you are looking at images, label data etc. as soon as you search, but note that's only for the first 50 records, as PNW groups into bins of 50, which need to be clicked open separately to be viewed or searched. Also, after you've executed your search, it takes three clicks to sort the results (e.g. by Collector #) in CPNWH while it takes only one click in CCH, so I think the CCH setup ends up being more efficient for certain information needs."

--> To me, one of the things that iDigBio should be concerned about is having the portal be easily usable. If we want it to be the "one stop" for biodiversity data, we need to see what users can get from other portals and provide improvements to that level of info. If it's easier to get the info by using a combination of other sources, folks still might do that. In the examples I sent along on the screen shots, that's what I am trying to show - what we present should be at least as good as what you can get elsewhere. If you compare the results of the "label" view that you get in the iDigBio portal with that in the CPNWH portal, it's clear (at least to me....) that ours is inferior for the reasons that I pointed out. See example.

R. Rabeler, TTD TCN (2/2015) https://www.idigbio.org/redmine/issues/1396

Database Search resulted in a rich specimen record dataset (in this case for lichens and bryophytes) for a participant at the IPT workshop. The researcher wants only the distinct taxon names (and count of speicmens per distinct taxon name). The researcher describes that downloading the dataset is then followed by "a lot of work" to get the distinct list of taxon names to share with a colleague.

  • How do we link to instructions for how to do this sort of task either a) through our existing UI if possible, or b) to an API example. In this use case, Alex and Matt verified it is possible already, to do this (sort of) through our existing API.
Mac Alford, (2015), entered here by Deb.
Collections - It would be great if collections could get a sense of their collections uniqueness – what do they have in their collections that no other collection in the portal has – either taxonomically or geographically. It would be great if you could get a uniqueness factor and display a summary of the unique aspects of the collection. Maybe even include preparation information – do I have tissues that nobody else has of a particular species or from a particular area? This would be very handy in grant proposals and in justifying the existence of particularly small collections. Andy Bentley (3/2015)
Researchers – It would be very handy if researchers could subscribe to a portal in order to get an alert when specimens of target species or geographic regions are added to the portal. For research purposes, this would alert them to new material that may warrant their inspection and would facilitate loan traffic from collections that have newly catalog material. Andy Bentley (3/2015)