Dr. Corinna Gries is PI and head of the North American Lichens and Bryophytes Thematic Collections Network. An accomplished researcher and programmer, here she is interviewed by Jill Holliday and discusses some of the history of specimen databasing, the goal of the North American Lichens and Bryophytes TCN, and the importance of public participation and crowd-sourcing to the TCN databasing projects.
Holliday: Corinna, you are the head of the North American Lichens and Bryophytes TCN.
Gries: Yes, and I have great team members in this endeavor: Tom Nash is the lichenologist on the team and Ed Gilbert is the main IT person.
Holliday: How many specimens does that cover?
Gries: About 2.3 million specimens distributed across 65 institutions.
Holliday: Where are you based?
Gries: At the University of Wisconsin-Madison. I am at the Center for Limnology, where I am also the information manager for the NSF funded North Temperate Lakes Long-Term Ecological Research (NTL-LTER) site.
Holliday: What does your current work entail?
Gries: Information management means archiving data. I usually describe this as running a museum for data, only it’s all virtual. And because this is a relatively new field, it requires a fair amount of custom programming, developing effective procedures using existing software, and training of users. In more specific terms, we are trying to provide a framework for scientists to discover, integrate, and analyze data in new and innovative ways. At the same time, we are preserving environmental data and information for future generations. Obviously, the latter is the main goal of natural history collections as well. In this ADBC digitization project, we will make those data available to a broader audience so that they may contribute to generating new knowledge in the area of biodiversity research.
Holliday: But you haven’t always been a programmer or information scientist. What did you originally set out to do?
Gries: I grew up in Germany and obtained my Ph.D. in Botany at the University of Kiel in 1988. I was a researcher in botanical eco-physiology for many years. I was exposed to lichen herbariums when I worked as a post-doc in Arizona studying metabolic responses of lichens to air pollution. I started programming during that time and I developed an early herbarium management system about twenty years ago.
Holliday: That was back in the early 90s? How were the processes different then?
Gries: Data structures and ideas for what we do now were being developed in the community even then. I can remember an early meeting where everyone was talking about the problem of changing geographic names and political borders and how to resolve them. The solution, of course, was to use latitude/longitude, but that was long before GPS and you still had to go to a library and look at old maps and try to sort it out based on sparse information on labels that may be a couple hundred years old.
Holliday: Even before you became involved with the TCN, you were part of the Consortium of North American Bryophyte Herbaria and the Consortium of North American Lichen Herbaria. Both of those are much older than the current TCNs. Tell me how those got started.
Gries: The software (called Symbiota) that powers the consortium portals has been developed over the last 15 years with several NSF grants, and the communities using and managing the portals have developed along with the improving software. Many people recognized the utility of a database for their collections early on; once those databases started growing the next logical step was to publish the information on the web. That is what Symbiota does for portal participants. Symbiota is a content management system for biodiversity information - for techies, it is like Drupal for biodiversity. So it is essentially a programming framework that makes it simple to get content on the web and organize it. It is best suited for groups of biodiversity collections that want to organize their information regionally or taxonomically.
Holliday: How is that different from the TCN?
Gries: Our TCN is focused on optimizing digitization of lichen and bryophyte collections information and not on programming a web application.
Holliday: How long do you expect this project to take?
Gries: Well, the NSF ADBC grant period is four years and we have estimated that in this timeframe we can digitize the 2.3 million specimens from Mexico, the U.S., and Canada that are held in U.S. That includes an image of the label and transcription of the label information, but very few images of the actual specimens. Specimen images will mostly be examples: unlike many vascular plants, lichens and bryophytes rarely can be identified from a photograph - you need microscopy and, in many cases, chemical analyses.
Holliday: The specimens aren’t your focus, then. You are working to image the labels themselves? What are the biggest challenges you are facing right now?
Gries: So far it is going as expected, no major hurdles and we have reasonable expectations to accomplish what we set out to do. My main concern is that there won’t be enough volunteers willing to help with the manual typing (transcription) work. Our project will rely to a certain degree on crowd sourcing, that is, involving the general public in the process of transcribing the label information into a digital format or database.
Holliday: So there is a role for crowd-sourcing and citizen scientists?
Gries: Yes, this brings us back to the Symbiota software. It is set up to allow volunteers to obtain an account with one of the portals and enter data. The images of labels can be viewed in the application and the information from the label entered into a form. The labels are routinely run through a program that automatically tries to read as many of the words on the label as possible. This is called optical character recognition (OCR). But this OCR, even as optimized as we have it now, can only read so much. Hand writing and old typewriter fonts are a problem. It is actually kind of interesting to look at these labels and try to guess what has happened to them. There are many scribbled notes, funny typos, and other interesting things on them. Therefore, all automation must be double-checked by humans.
Holliday: So how does this work then? A person gains access and then what?
Gries: Once a person has access to the data editing functions in the bryophyte or lichen portal, he or she is taken to our data entry screen. This screen has fields for all the information on a specimen label, i.e., genus, species names, collector, collection date, locality, latitude and longitude, etc. Most of these fields offer some help with entry, e.g., genus and species names are provided in ‘pull-down’ menus. We use a program called GeoLocate, which was developed at Tulane University. Once invoked on a locality description that includes some geographic name, like a town, or road names, it will try to find the best matches on the map. The user is then presented with a map on which are several possible dots and he or she can choose which one is the correct. We are especially hoping that volunteers who know a certain area well will help with the geo-referencing specimens from that area.
UW article summarizing the project: http://www.news.wisc.edu/19609
Project website: http://lbcc.limnology.wisc.edu/
Lichen consortium website: http://lichenportal.org
Bryophyte consortium website: http://bryophyteportal.org
Symbiota software website: http://symbiota.org