OCR Resources

From iDigBio
Revision as of 13:33, 25 August 2014 by Dpaul (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

OCR Software used by ADBC projects

  • ABBYY FineReader - high performing proprietary OCR software provided by the ABBYY software company. The Professional and Corporate Editions are designed specifically for Microsoft Windows operating systems.
  • GOCR (or JOCR) is a free optical character recognition program, initially written by Jörg Schulenburg. It can be used to convert or scan image files (portable pixmap or PCX) into text files.
  • OCRopus - free document analysis and optical character recognition (OCR) system released under the Apache License, Version 2.0 with a very modular design through the use of plugins.
  • Tesseract - Open source optical character recognition engine available under the Apache License, Version 2.0. Software is capable to functioning on various operating systems. Considered to be one of the more accurate OCR engines that are available under a free software license.
  • Xerox OCR engine -

Webinars / Demos

Biodiversity Informatics Tools Incorporating OCR Technology

  • Apiary Project - High-throughput workflow for computer-assisted human parsing of biological specimen label data
  • HerbIS - (Erudite Recorded Botanical Information Synthesizer) - Software algorithms that processes and presents herbarium label data in machine-understandable format through the use of natural language processing (NLP). Created at the Yale Peabody Museum of Natural History.
  • Symbiota - Specimen-based virtual flora/fauna software with a built in module for specimen digitization that incorporates OCR technology
  • SALIX - Semi-automatic Label Information eXtraction system is designed to capture herbarium specimen label data with the use of optical character recognition technologies and transfer those data into a database.
  • ScioTR - A new touch-enabled Windows 8 app which integrates Optical Character Recognition (OCR), Natural Language Parsing (NLP) and Machine Learning (ML) to provide an efficient workflow for capturing highly-structured data from images. ScioTR allows the user to parse or excavate the image for regions of interest using a touch screen interface. By doing this, OCR, NLP and ML strategies are more effective and thus require less human interaction later in the workflow. ScioTR works best when used in concert with a commercially available OCR engine. It has some NLP and ML modules inside of it. ScioTR also allows for the configuration of a custom field set. ScioTR was presented at the SPNHC DemoCamp in 2013. The powerpoint given at the conference as well as a video of the demo is available here: ScioTR.com. You can also find more technical information our software development blog, ScioChronicle and/or the ScioQualis YouTube Channel. We hope to have ScioTR in the Windows 8 store around the end of Jan 2014. For a good outline of the current features and user experience, take a look at the ScioTR Help Documentation, which is still currently being compiled.

Coding Outcomes from the aOCR Hackathon (Feb 2013)

Sample Images

Museum Specimen Label Examples

Back to the aOCR Wiki