iDigBio Webinar: Visualize Your Text Data Using OCR Output

by Deb Paul and Andrea Matsunaga for: Jason Best, Sylvia Orli, William Ulate, Miao Chen, and Reed Beaman

Recorded Webinar:Visualize Your Text Data Using OCR Output
     00:00-15:23 Overview (Deb Paul)
     15:24-29:13 Live DEMOs (Andrea Matsunaga and Jason Best)
     29:14-55:57 Discussion

More than 30 people attended iDigBio's Webinar titled Visualize Your Text Data Using OCR Output to hear more about using the text output from Optical Character Recognition (OCR) software to enhance and improve the digitization process. Anyone involved in capturing, cleaning, validating, and georeferencing specimen data can see the potential of the ideas for improving data quality and the transcription experience.

Recent research also indicates the techniques demonstrated in this webinar also increase user happiness (Drinkwater et al, 2014). How? When a new specimen data record has only a few fields filled in (maybe just the bar code and an image!), users can search OCR text to create their own datasets, choosing by their expertise, for example, or their personal interests. They may be able to read a particular foreign language or a certain person's handwriting. Or if they are georeferencing, they can use these methods to create a record set for a particular familiar area. Collection Managers can use these methods too, to create datasets for researchers.

The iDigBio Augmenting Optical Character Recognition Working Group (aOCR wg) has been spreading the word about not only improvements in parsing algorithms of OCR output, but other important uses for the OCR output such as the visualization techniques described in this webinar. At the iDigBio #Citscribe Transcription Hackathon, members from the broader community, Sylvia Orli (Smithsonian), Miao Chen (Indiana University, Data to Insight), as well as members from the Cyberinfrastructure Working Group CYWG, Andrea Matsunaga (iDigBio), and aOCR iDigBio Working Group: Jason Best from the Botantical Research Institute of Texas (BRIT), William Ulate from Biodiversity Heritage Library (BHL), and Deborah Paul (iDigBio) formed one of four teams.

Background.

At TDWG 2013, Elspeth Haston (aOCR WG member), demonstrated how she and her colleagues at the Royal Botanic Garden Edinburgh (RBGE) used word clouds, created from OCR of herbarium labels, to reveal otherwise hidden, but very useful search terms. The focus of the recent iDigBio #Citscribe Hackathon was to add functionality to existing crowd-sourcing tools used in transcription of museum specimen data both on labels and in notebooks. Using existing OCR output from lichen, herbarium, and entomology labels and a new dataset from the Biodiversity Volunteer Portal (Thanks Sylvia Orli), the OCR Integration Track (LlLl) Team set out to build on the RBGE word cloud idea. We demonstrated how existing open source software can be used to see the OCR output in new ways and link to the data records (and images) themselves.

Webinar Recording: Visualize Text Data Using OCR Output
Webinar Powerpoint (as pdf): Enhancing Crowdsourcing using Text Analytics and Visualization
Related Blog post:CITSCribe Hackathon
Related iDigBio CITSCribe wiki
Related Blog post: While Standing on the Shoulders of Giants William Ulate (BHL)
Related paper: The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels Robyn Drinkwater, Robert Cubey, Elspeth Haston in PhytoKeys 38 (2014) : 15-30
More OCR Work of interest: Hacking OCR for pro-iBiosphere (March 2014) by David P. Shorthouse, Rod Page, Kevin Richards, Marko Tähtinen

Details.

The LlLl team used Apache Solr to index the OCR text output from lichen (and other datasets) and to index data from crowdsourced records in the BVP system to apply the same concept to records transcribed by volunteers that need validation. The same OCR text outputs were processed with an n-gram based Natural Language Processing (NLP) algorithm to measure the confidence of each word produced by OCR, using human-validated specimen data as a training set. Once indexed and with a visual confidence of each OCR output, Google Charts and Carrot2 were used to provide different visualizations and access to the confidence data. While the Google Charts visualization makes it easy to find words by its frequency, which makes easy to find rare words, Carrot2 software presents the user with three different views of their data (folder-based tree, in a circle and forming a foam aspect) and offers various algorithms for clustering the indexed terms (Lingo, STC and K-means). The terms are also nested. Clicking on any cluster, reveals terms found in that set itself. BHL planning to use this idea in Purposeful Gaming (along with code from Ben Brumfield of FromThePage.com) for recognizing "junk OCR" to sort labels into those good for automated parsing and those that need humans.

We see changes in the community's ability to harness these strategies and look forward to seeing even more use. Symbiota sites that have chosen to implement OCR, can use OCR output with ML and NLP to facilitate automated data entry and as a result, users can create their own datasets for transcription, validation, and georeferencing. Specify users using OCR can also search the OCR output for a record or records. The broader community and the iDigBio aOCR working group are sharing software and ideas to enhance their own data transcription capabilities, data enhancement, and data visualization strategies. We are nurturing innovation, step-by-step and work is going on that will make it easier to share and use OCR output - See blog post: BIOSPEX—A Crowdsourcing Management System

Want to try what you see in this Webinar with your text data? Step-by-step instructions for Apache SOLR indexing and Carrot2 by Andrea Matsunaga

Ongoing discussions after this webinar include:

  • Integration of these functions in crowd-sourcing tools.
  • Users are asking if they can export OCR output from various software tools (Symbiota, Specify, etc)...
  • Multi-keyed records are great only if we can get a consensus record!
  • Not all crowd-sourcing tools use multi-keying so they skip the need for a consensus strategy (ALA BVP for example.
  • Others are interested in the further development of the confidence-scoring algorithms and BHL is planning to implement this in their crowd-souring tools.
  • Parsing of OCR output can be used to improve the OCR (ask Bryan Heidorn how)!
  • This blog based on these slides and recording of the webinar.
  • Please contact us if you have questions or want to work on some of this with us in the aOCR working group.