Improvement of Omnipage18's Efficiency

TitleImprovement of Omnipage18's Efficiency
Publication TypeConference Paper
Year of Publication2012
AuthorsSteinke, KH
Conference NameIEEE International Conference on Computer Science and Automation Engineering (CSAE 2012)
Date Published05/2012
PublisherIEEE
Conference LocationZhangjiajie
Keywordshandwriting, localisation, Omnipage18, text region
AbstractIn many Botanical Museums in the world exist millions of dried plants on paper sheets. They were collected in the last two hundred years. The collectors left annotations about the plants on the sheets. Some of them are written in handwriting others by a typewriter. There exist also comments on printed forms which are glued on the sheets. By an automatic OCR of the scanned herbarium objects the recognized texts can be put into a database. Commercial OCR-programs like Finereader or Omnipage are capable to recognize undisturbed printed texts in a correct way. Unfortunately we deal with historical material which often contains overwritten or crossed out words, the paper is yellowed, molded and has other artifacts. Moreover it seems to be extremely complicated to detect text regions in a complex environment with objects like roots, leaves, stamps, bar codes, yardsticks, color charts etc.. The old writing leads to read errors which are not the biggest problem, because they can be compensated by tolerant database queries. Worse is that sometimes text regions cannot be localized. In this paper a method is presented which helps Omnipage to detect missed text and also can distinguish printed text from handwriting.

Comments

Submitted by dpaul on

OCR, is it Text or Handwriting?

Ever wonder how OCR might one day distinguish handwriting from text? In this paper published in May 2012, Karl-Heinz Steinke illustrates his latest work using OCR software Omnipage 18 and Omnipage SDK (that's software development kit to: 1) improve detection of otherwise missed text and 2) distinguish printed text from handwriting. Clear illustrations are used to explain the process of how the software finds text and handwriting and determines which is which. While the examples are of text and writing on herbarium sheets, the methods ought to apply to text and writing on other label types as well. Steinke also notes the ability to distinguish between handwriting and text means that nonsense text produced by OCR of handwriting could be algorithmically removed from the OCR output leaving behind only the text.