SPNHC 2014: Progress in Digitization: Using optical character recognition (OCR) output in digitization

TitleSPNHC 2014: Progress in Digitization: Using optical character recognition (OCR) output in digitization
Publication TypePresentation
Year of Publication2014
AuthorsPaul, Deborah L., Matsunaga Andrea, Chen Miao, Best Jason, Orli Sylvia, and Haston Elspeth M.
Keywordsconfidence scores, data visualization, digitization, Machine Learning, Natural Language Processing, optical character recognition, SPNHC 2014, SPNHC 2014: Progress in Digitization
AbstractBefore iDigBio, that's Integrated Digitized Biocollections, others began the work of figuring out how to use OCR output with machine learning (ML) and natural language processing (NLP) to improve the efficiency and speed with which data from museum specimen label images can be captured and validated. The Augmenting Optical Character Recognition (aOCR WG) Working Group at iDigBio is pleased to be building on that foundation. Improvements have been realized in parsing algorithms and visualization of data. Recently, researchers at The Royal Botanic Garden Edinburgh (RBGE) successfully used word clouds from OCR output to reveal useful data, otherwise dark, until a specimen label was digitized completely (Haston, et al TDWG 2013). Their work indicates this method results in greater transcriber job satisfaction. Inspired by this work, the OCR Integration Track Team at the iDigBio Citscribe Hackathon showed how indexing, scoring, and visualizing OCR output reveals otherwise hidden search terms, uncovers errors, and can improve the data transcriber and data validator experience. Using open-source software, we presented these ideas to those with transcription tools up-and-running including Notes From Nature, Biodiversity Heritage Library (BHL), ALA Biodiversity Volunteer Portal, Smithsonian Digital Volunteers, and the Lichen, Bryophytes and Climate Change Volunteer Portal using Symbiota software. The aOCR WG is working collaboratively with the Joint Research Activity (JRA) Synthesys3 Project to share expertise for automated data collection for digital images. Got text in your images? How might OCR output work for you? Come talk to our aOCR WG to find out and to share your expertise.