Five iConference2013 Talks: Difference between revisions

Revision as of 12:22, 28 January 2013

Five iConference2013 Talk Abstracts

Introducing iDigBio and the Augmenting OCR Working Group

Deborah Paul: Abstract

Digitization of biocollections -- a grand challenge in scope, scale, and significance

Amanda Neill: Abstract

The Apiary Project -- a workflow for text extraction and parsing for herbarium specimens

Jason Best: Abstract

Symbiota -- Creating an OCR and NLP enabled user interface and workflow to efficiently digitize 2.3 million lichen and bryophyte specimens

Edward Gilbert: Abstract

HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text

Bryan Heidorn: A central goal of the digitization of museum specimen labels is to parse the content of the labels into standard fields such as those identified in the Darwin Core (DwC). Museum specimen labels number in the billions and may be hundreds of years old or created recently. They have some minimal layout consistency within subsets of labels but overall have a very high layout variability. The text generated via OCR from these labels tends to have a high rate of errors. This combination makes it extremely difficult for programmers to write rule systems or other matching algorithms to classify the sub-elements of the labels into the DwC fields. Statistical methods such as Naive Bayes, Hidden Markov Models and N-Gramming can be combined with human supervision and authority files to successfully parse OCR output into proper fields and correct some OCR errors based on context.

Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content

John Mignault: Abstract

Back to

iConference 2013 iDigBio AOCR WG Wiki

@@ Line 1: / Line 1: @@
 ==Five iConference2013 Talk Abstracts==
-===HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text===
-:::;Bryan Heidorn: A central goal of the digitization of museum specimen labels is to parse the content of the labels into standard fields such as those identified in the Darwin Core (DwC). Museum specimen labels number in the billions and may be hundreds of years old or created recently. They have some minimal layout consistency within subsets of labels but overall have a very high layout variability. The text generated via OCR from these labels tends to have a high rate of errors. This combination makes it extremely difficult for programmers to write rule systems or other matching algorithms to classify the sub-elements of the labels into the DwC fields. Statistical methods such as Naive Bayes, Hidden Markov Models and N-Gramming can be combined with human supervision and authority files to successfully parse OCR output into proper fields and correct some OCR errors based on context.
+=== Introducing iDigBio and the Augmenting OCR Working Group===
+:::;Deborah Paul: Abstract
+===Digitization of biocollections -- a grand challenge in scope, scale, and significance===
+:::; Amanda Neill: Abstract
+===The Apiary Project -- a workflow for text extraction and parsing for herbarium specimens===
+:::;Jason Best: Abstract
+===Symbiota -- Creating an OCR and NLP enabled user interface and workflow to efficiently digitize 2.3 million lichen and bryophyte specimens===
+:::;Edward Gilbert: Abstract
+===HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text===
+:::;Bryan Heidorn: A central goal of the digitization of museum specimen labels is to parse the content of the labels into standard fields such as those identified in the Darwin Core (DwC). Museum specimen labels number in the billions and may be hundreds of years old or created recently. They have some minimal layout consistency within subsets of labels but overall have a very high layout variability. The text generated via OCR from these labels tends to have a high rate of errors. This combination makes it extremely difficult for programmers to write rule systems or other matching algorithms to classify the sub-elements of the labels into the DwC fields. Statistical methods such as Naive Bayes, Hidden Markov Models and N-Gramming can be combined with human supervision and authority files to successfully parse OCR output into proper fields and correct some OCR errors based on context.
+===Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content===
+:::;John Mignault: Abstract
 === Back to ===
 [[iConference 2013 iDigBio AOCR WG Wiki]]

Five iConference2013 Talks: Difference between revisions

Revision as of 12:22, 28 January 2013

Contents

Five iConference2013 Talk Abstracts

Introducing iDigBio and the Augmenting OCR Working Group

Digitization of biocollections -- a grand challenge in scope, scale, and significance

The Apiary Project -- a workflow for text extraction and parsing for herbarium specimens

Symbiota -- Creating an OCR and NLP enabled user interface and workflow to efficiently digitize 2.3 million lichen and bryophyte specimens

HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text

Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content

Back to

Navigation menu

Five iConference2013 Talks: Difference between revisions

Revision as of 12:22, 28 January 2013

Five iConference2013 Talk Abstracts

Introducing iDigBio and the Augmenting OCR Working Group

Digitization of biocollections -- a grand challenge in scope, scale, and significance

The Apiary Project -- a workflow for text extraction and parsing for herbarium specimens

Symbiota -- Creating an OCR and NLP enabled user interface and workflow to efficiently digitize 2.3 million lichen and bryophyte specimens

HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text

Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content

Back to

Navigation menu

Search