OCR / NLP Workflows: Difference between revisions

Jump to navigation Jump to search
(added NYBG workflow)
Line 10: Line 10:




'''Image Processing for OCR:'''  Our main goal during this step is to reduce the OCR processing time without reducing the quality of the OCR output.  There are two ways to accomplish this.  1) Turn the image grayscale (thereby reducing the filesize, but not the pixel dimensions) or 2) decreasing the pixel dimensions of the full color image.  Both of these methods seem to produce usable OCR and have processing times of ~1 minute per full sized (1 MB) image.  Finer tuned comparisons my indicate that one of these methods is preferable over the other although maintaining a x-height of 20 pixels seems to be the most important variable for good OCR. For more information on optimizing OCR go here: [https://www.idigbio.org/wiki/index.php/OCR_Tips#FineReader_tips|Abbyy Finereader Tips.]
'''Image Processing for OCR:'''  Our main goal during this step is to reduce the OCR processing time without reducing the quality of the OCR output.  There are two ways to accomplish this.  1) Turn the image grayscale (thereby reducing the filesize, but not the pixel dimensions) or 2) decreasing the pixel dimensions of the full color image.  Both of these methods seem to produce usable OCR and have processing times of ~1 minute per full sized (1 MB) image.  Finer tuned comparisons may indicate that one of these methods is preferable over the other although maintaining a x-height of 20 pixels seems to be the most important variable for good OCR. For more information on optimizing OCR go here: [https://www.idigbio.org/wiki/index.php/OCR_Tips#FineReader_tips|Abbyy Finereader Tips.]


'''Using OCR in Specimen Cataloging:'''  Though perfect parsing algorithms are still being developed, considerable advantages can be obtained by sorting yet-to-be cataloged specimens by extracting information from the OCR (sort by label types for example).  For some ideas on how to do this, refer to the following powerpoint: [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/gottschalk_gainesville.pptx OCR implementation in The Caribbean Plants Digitization Project].
'''Using OCR in Specimen Cataloging:'''  Though perfect parsing algorithms are still being developed, considerable advantages can be obtained by sorting yet-to-be cataloged specimens by extracting information from the OCR (sort by label types for example).  For some ideas on how to do this, refer to the following powerpoint: [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-wgw/gottschalk_gainesville.pptx OCR implementation in The Caribbean Plants Digitization Project].
4

edits

Navigation menu