OCR / NLP Workflows: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
Line 16: Line 16:


'''Using OCR in Specimen Cataloging:''' Though perfect parsing algorithms are still being developed, considerable advantages can be obtained by sorting yet-to-be cataloged specimens by extracting information from the OCR (sort by label types for example). For some ideas on how to do this, refer to the following PowerPoint presentations: [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/gottschalk_gainesville.pptx OCR implementation in The Caribbean Plants Digitization Project] and [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/Watson-Tri-Trophic-Digitization-OCR.pptx Tri-Trophic Digitization: Putting the OCR in Workflow].
'''Using OCR in Specimen Cataloging:''' Though perfect parsing algorithms are still being developed, considerable advantages can be obtained by sorting yet-to-be cataloged specimens by extracting information from the OCR (sort by label types for example). For some ideas on how to do this, refer to the following PowerPoint presentations: [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/gottschalk_gainesville.pptx OCR implementation in The Caribbean Plants Digitization Project] and [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/Watson-Tri-Trophic-Digitization-OCR.pptx Tri-Trophic Digitization: Putting the OCR in Workflow].
----


== OCR Workflow at the Royal Botanic Garden Edinburgh ==
== OCR Workflow at the Royal Botanic Garden Edinburgh ==

Revision as of 17:39, 18 June 2014

OCR / NLP Workflow & Protocol Documents

The SALIX Method

The SALIX Method: A semi-automated workflow for herbarium specimen digitization. Barber, A.C, Lafferty, D., & Landrum, L.R. in press. Taxon,Volume 62, Number 3, 17 June 2013, pp. 581-590(10) DOI: http://dx.doi.org/10.12705/623.16

Abstract. Supported by a United States American Recovery and Reinvestment Act grant, we have developed a workflow, “the SALIX Method,” to image, database, and provide web access to ca. 60,000 Latin American plant specimens housed at the Arizona State University Herbarium. The SALIX Method incorporates optical character recognition using ABBYY FineReader and uses other proprietary software for word processing (Microsoft Word) and image management (Adobe Lightroom). We developed the other applications ourselves: SALIX for text parsing, and BarcodeRenamer (BCR) for renaming image files to match their barcodes. We use our Symbiota data portal (SEINet) to provide web access to collections data and images. Data entry was found to be about as fast to considerably faster using the SALIX Method than by keystroke entry directly into SEINet. Speed is dependent on label quality and length as well as user proficiency.


Arizona State University Herbarium Digitization Project

Exemplar Workflow: Using the Output from OCR in SALIX


OCR Workflow at the New York Botanical Garden:

Image Processing for OCR: Our main goal during this step is to reduce the OCR processing time without reducing the quality of the OCR output. There are two ways to accomplish this. 1) Turn the image grayscale (thereby reducing the filesize, but not the pixel dimensions) or 2) decreasing the pixel dimensions of the full color image. Both of these methods seem to produce usable OCR and have processing times of ~1 minute per full sized (1 MB) image. Finer tuned comparisons may indicate that one of these methods is preferable over the other, although maintaining an x-height of 20 pixels seems to be the most important variable for good OCR. For more information on optimizing OCR go here: Finereader Tips.

Using OCR in Specimen Cataloging: Though perfect parsing algorithms are still being developed, considerable advantages can be obtained by sorting yet-to-be cataloged specimens by extracting information from the OCR (sort by label types for example). For some ideas on how to do this, refer to the following PowerPoint presentations: OCR implementation in The Caribbean Plants Digitization Project and Tri-Trophic Digitization: Putting the OCR in Workflow.


OCR Workflow at the Royal Botanic Garden Edinburgh

Draft OCR workflow for RBGE

OCR Workflow in ScioTR

Workflow in ScioTR

Back to the aOCR Wiki