Label Recognition in Herbarium Specimens by using Database-Queries

TitleLabel Recognition in Herbarium Specimens by using Database-Queries
Publication TypeConference Paper
Year of Publication2011
AuthorsGehrke, M., Steinke KH, and Dzido R.
Conference NameICDM 2011 IEEE International Conference on Data Mining
Date Published09/2011
PublisherIEEE
Conference LocationVancouver, Canada
Other NumbersISSN 1864-9734
Keywordsdatabase, label recognition, OCR-texts, rotation-invariant, scale-invariant, SURF, template matching
AbstractFor hundreds of years plant specimens are collected in herbariums for scientific purposes. These plants are mounted on specimen sheets and labeled with the essential data such as the name of the collector, the date and the place it was found. To make them available for a wider public the specimen sheets get digitized for online use. The purpose of our project in cooperation with the Botanical Garden in Berlin (Germany) is the development of a software-system for the analysis of these high resolution images. There are different approaches like template matching or SURF for detecting stable objects like color-charts, rulers, barcodes or even labels (fig. 1) on specimens. This paper describes a new method for the automatic scale- and rotation-invariant recognition of prior defined label-types by using OCR-engine generated texts contained in a database.
Refereed DesignationRefereed

Comments

Submitted by dpaul on

Very exciting paper!

With this new algorithm, those capturing images of herbarium sheets will be able to group the sheets into sets by prior defined label-types. Sets by a given collector or from a given collecting event would group together. The new algorithm (5 seconds per sheet) is also more than 300% faster than the old template-matching  algorithm (1800+ seconds per sheet). Other research tells us (de la Cerda 2010, private communication Elspeth Haston 2012) that data entry staff are happier and more productive when entering ordered information. This makes sense thinking about the repetitiveness of the data-entry task. If all the images in the stack to be digitized are already the same collector, then many will be from the same collecting event and all that will change is the taxonomic identification. Many herbaria (and other specimen collections) spend a lot of time organizing their collections before digitization (pre-digitization curation). For very large herbaria planning on using more industrial processes to capture herbarium sheet images, using technology like this would eliminate the need to pre-sort the collection or could greatly reduce the amount of pre-sorting needed.