4,713
edits
Line 21: | Line 21: | ||
:::;Bryan Heidorn: A central goal of the digitization of museum specimen labels is to parse the content of the labels into standard fields such as those identified in the Darwin Core (DwC). Museum specimen labels number in the billions and may be hundreds of years old or created recently. They have some minimal layout consistency within subsets of labels but overall have a very high layout variability. The text generated via OCR from these labels tends to have a high rate of errors. This combination makes it extremely difficult for programmers to write rule systems or other matching algorithms to classify the sub-elements of the labels into the DwC fields. Statistical methods such as Naive Bayes, Hidden Markov Models and N-Gramming can be combined with human supervision and authority files to successfully parse OCR output into proper fields and correct some OCR errors based on context. | :::;Bryan Heidorn: A central goal of the digitization of museum specimen labels is to parse the content of the labels into standard fields such as those identified in the Darwin Core (DwC). Museum specimen labels number in the billions and may be hundreds of years old or created recently. They have some minimal layout consistency within subsets of labels but overall have a very high layout variability. The text generated via OCR from these labels tends to have a high rate of errors. This combination makes it extremely difficult for programmers to write rule systems or other matching algorithms to classify the sub-elements of the labels into the DwC fields. Statistical methods such as Naive Bayes, Hidden Markov Models and N-Gramming can be combined with human supervision and authority files to successfully parse OCR output into proper fields and correct some OCR errors based on context. | ||
===Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content=== | === [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-hackathon/AOCRandBHL.ppt Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content] === | ||
:::;John Mignault: The Biodiversity Heritage Library, a global consortium of natural history and botanical libraries, is an ongoing project digitizing the legacy literature in their collections for open access. In its partnership with the Internet Archive and through their portal BHL has made 40 million pages available for open access by the global research community. The rapid growth of the text corpus has led to challenges in identifying and extracting semantic information from it, many of them similar to the challenges faced in OCR workflow and extraction from specimen labels. We will discuss the possible improvements in knowledge extraction that could result from improvements in OCR workflow and accuracy, as well as the implications for more intelligent and integrated data integration for biodiversity informatics. | :::;John Mignault: The Biodiversity Heritage Library, a global consortium of natural history and botanical libraries, is an ongoing project digitizing the legacy literature in their collections for open access. In its partnership with the Internet Archive and through their portal BHL has made 40 million pages available for open access by the global research community. The rapid growth of the text corpus has led to challenges in identifying and extracting semantic information from it, many of them similar to the challenges faced in OCR workflow and extraction from specimen labels. We will discuss the possible improvements in knowledge extraction that could result from improvements in OCR workflow and accuracy, as well as the implications for more intelligent and integrated data integration for biodiversity informatics. | ||
=== Back to === | === Back to === | ||
[[iConference 2013 iDigBio AOCR WG Wiki]] | [[iConference 2013 iDigBio AOCR WG Wiki]] |