Presentations & Reports

From iDigBio
Jump to navigation Jump to search

Talks & Reports from Hackathon 1


Hackathon Overview & Intro to iDigBio - Deborah Paul
Overview of the Hackathon goals and introduction to iDigBio for those new to the project. Review this presentation before the other hackathon presentations to learn about the aOCR working group and the talks, presentations, and work being done at this hackathon and after.
Hackathon Metrics - Alex Thompson
Parsing Dataset 1 using SALIX 2 - Daryl Lafferty
SALIX is “Semi-Automatic Label Information eXtraction” parsing system, developed and used extensively at Arizona State University. The purpose is to parse OCR'd label data into the respective data fields (e.g. Collector, collection number, etc.). The original SALIX required user intervention with each label to format and proofread. SALIX 2 tries to remove the “Semi” and make it fully automatic. Written in C++ in Windows. Development was focused on Lichen labels.
Improving OCR Inputs from OCR Outputs - Ben Brumfield
Efforts to improve the quality of OCR by pre-processing images based on the output of 'naive' OCR execution. Topics included handwriting detection within Dataset 1 (final report) and label extraction from Dataset 3 (final report).
Image Segmentation - Phuc Nguyen
Parsing Dataset 1: Regular Expression-Based Parsing of Tesseract Output from Lichen Herbarium Labels - Robert Anglin
The Lichens, Bryophytes and Climate Change (LBCC) project endeavors to digitize the label information from millions of North American lichen and bryophyte herbarium specimens. These labels are occasionally hand written although usually at least partially typed or printed. When typed or printed a myriad of fonts have been used. Thus far, we have been constrained to use open-source Optical Character Recognition (OCR) software. By most accounts the best of these is Tesseract. While it does not recognize handwriting it is considered to be the best at recognizing typewritten text images. Our workflow involves attaching a barcode to labels, imaging them, assigning the barcode to be the image filename and submitting the images to an FTP server to be entered into Symbiota, a MySQL database designed by Ed Gilbert, with the barcode as the catalog number. Once entered into Symbiota, the images are batch processed with Tesseract and the results entered into the database as well. The next step is to parse the Tesseract output to retrieve the label information in hopes of being able to populate relevant fields in the database. I have been developing code using PHP version 5.3 and its PCRE regular-expression library for this purpose.
LABELX - Bryan Heidorn & Qianjin Zhang

Label Annotation through Biodiversity Enhanced Learning (LABELX) (Heidorn & Wei, 2008; Heidorn & Zhang, 2013) is a Hidden Markov Model-based system HMM) (Frasconi et al., 2003) with a number of preprocessing and post-processing algorithms added. HMMs exploit the order of elements in a label and Bayesian conditional probability to predict the proper classification of elements of the label. Because there are millions of scientific names and any individual name is unlikely to be in a training set of 100-200 records used here, for plants the system uses International Plant Name Index (IPNI, 2012) of scientific names and authorities to substitute generic tokens for, genus, species and authority. Similar substitutions are made for integers and alphanumerics that are used in collector numbers and collection numbers.

Parsing Dataset 2 - Dmitry Mozzherin
iDigBio OCR Hackathon Initial Results - Rest API
Services and Workflow UIs - Robin and Paul Schroeder
Workflows - John Pickering
DarwinScore and Apiary - Jason Best

Back to Hackathon Wiki