Five iConference2013 Talks: Difference between revisions

Jump to navigation Jump to search
m
Line 18: Line 18:
::::::The presentation will begin with a brief overview of the LBCC project along with a simple workflow diagram. The OCR/NLP components of the workflow will be emphasized in greater detail. The Symbiota user interface will be displayed with a short explanation of how it will be used to digitize the label metadata. Current limitations and challenges with OCR and NLP processing will be explained as they pertain to this project. Several solutions to these issues that are being explored by the project will be explained in detail along with their preliminary results.
::::::The presentation will begin with a brief overview of the LBCC project along with a simple workflow diagram. The OCR/NLP components of the workflow will be emphasized in greater detail. The Symbiota user interface will be displayed with a short explanation of how it will be used to digitize the label metadata. Current limitations and challenges with OCR and NLP processing will be explained as they pertain to this project. Several solutions to these issues that are being explored by the project will be explained in detail along with their preliminary results.


===HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text===
===[https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-hackathon/LABELX.pdf HERBIS/LABELX] -- Machine Learning Approach to Parsing OCR Text===
:::;Bryan Heidorn: A central goal of the digitization of museum specimen labels is to parse the content of the labels into standard fields such as those identified in the Darwin Core (DwC). Museum specimen labels number in the billions and may be hundreds of years old or created recently. They have some minimal layout consistency within subsets of labels but overall have a very high layout variability. The text generated via OCR from these labels tends to have a high rate of errors. This combination makes it extremely difficult for programmers to write rule systems or other matching algorithms to classify the sub-elements of the labels into the DwC fields. Statistical methods such as Naive Bayes, Hidden Markov Models and N-Gramming can be combined with human supervision and authority files to successfully parse OCR output into proper fields and correct some OCR errors based on context.
:::;Bryan Heidorn: A central goal of the digitization of museum specimen labels is to parse the content of the labels into standard fields such as those identified in the Darwin Core (DwC). Museum specimen labels number in the billions and may be hundreds of years old or created recently. They have some minimal layout consistency within subsets of labels but overall have a very high layout variability. The text generated via OCR from these labels tends to have a high rate of errors. This combination makes it extremely difficult for programmers to write rule systems or other matching algorithms to classify the sub-elements of the labels into the DwC fields. Statistical methods such as Naive Bayes, Hidden Markov Models and N-Gramming can be combined with human supervision and authority files to successfully parse OCR output into proper fields and correct some OCR errors based on context.


4,713

edits

Navigation menu