OCR / NLP Workflows: Difference between revisions

no edit summary
No edit summary
Line 12: Line 12:


=== OCR Workflow at the New York Botanical Garden: ===
=== OCR Workflow at the New York Botanical Garden: ===
'''Image Processing for OCR:''' Our main goal during this step is to reduce the OCR processing time without reducing the quality of the OCR output. There are two ways to accomplish this. 1) Turn the image grayscale (thereby reducing the filesize, but not the pixel dimensions) or 2) decreasing the pixel dimensions of the full color image. Both of these methods seem to produce usable OCR and have processing times of ~1 minute per full sized (1 MB) image. Finer tuned comparisons may indicate that one of these methods is preferable over the other, although maintaining an x-height of 20 pixels seems to be the most important variable for good OCR. For more information on optimizing OCR go here: [https://www.idigbio.org/wiki/index.php/OCR_Tips#FineReader_tips%7CAbbyy Finereader Tips.]  
'''Image Processing for OCR:''' Our main goal during this step is to reduce the OCR processing time without reducing the quality of the OCR output. There are two ways to accomplish this. 1) Turn the image grayscale (thereby reducing the filesize, but not the pixel dimensions) or 2) decreasing the pixel dimensions of the full color image. Both of these methods seem to produce usable OCR and have processing times of ~1 minute per full sized (1 MB) image. Finer tuned comparisons may indicate that one of these methods is preferable over the other, although maintaining an x-height of 20 pixels seems to be the most important variable for good OCR. For more information on optimizing OCR go here: [https://www.idigbio.org/wiki/index.php/OCR_Tips#FineReader_tips%7CAbbyy Finereader Tips.]  


'''Using OCR in Specimen Cataloging:''' Though perfect parsing algorithms are still being developed, considerable advantages can be obtained by sorting yet-to-be cataloged specimens by extracting information from the OCR (sort by label types for example). For some ideas on how to do this, refer to the following PowerPoint presentations: [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/gottschalk_gainesville.pptx OCR implementation in The Caribbean Plants Digitization Project] and [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/Watson-Tri-Trophic-Digitization-OCR.pptx Tri-Trophic Digitization: Putting the OCR in Workflow].
'''Using OCR in Specimen Cataloging:''' Though perfect parsing algorithms are still being developed, considerable advantages can be obtained by sorting yet-to-be cataloged specimens by extracting information from the OCR (sort by label types for example). For some ideas on how to do this, refer to the following PowerPoint presentations: [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/gottschalk_gainesville.pptx OCR implementation in The Caribbean Plants Digitization Project] and [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/Watson-Tri-Trophic-Digitization-OCR.pptx Tri-Trophic Digitization: Putting the OCR in Workflow].
----
----
== OCR Workflow at the Royal Botanic Garden Edinburgh ==
== OCR Workflow at the Royal Botanic Garden Edinburgh ==
'''[http://www.idigbio.org/sites/default/files/working-groups/aocr/OCRWorkflowRBGE.docx Draft OCR workflow for RBGE]'''
'''[http://www.idigbio.org/sites/default/files/working-groups/aocr/OCRWorkflowRBGE.docx Draft OCR workflow for RBGE]'''
 
----
== OCR Workflow in ScioTR ==
== OCR Workflow in ScioTR ==
'''[http://www.idigbio.org/sites/default/files/working-groups/aocr/Workflow4idigbio.doc Workflow in ScioTR]'''
'''[http://www.idigbio.org/sites/default/files/working-groups/aocr/Workflow4idigbio.doc Workflow in ScioTR]'''
 
----
== [https://www.idigbio.org/wiki/index.php/Augmenting_OCR Back to the aOCR Wiki] ==
== [https://www.idigbio.org/wiki/index.php/Augmenting_OCR Back to the aOCR Wiki] ==
4,713

edits