OCR Tips: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= <u>'''FineReader Tips'''</u>  =
= '''FineReader Tips''' =


==== What works best:  ====
== What works best:  ==


*'''Recommended Image Resolution:''' 300 dpi for typical texts (printed in fonts of size 10pt or larger), 400–600 dpi for texts printed in smaller fonts (9pt or smaller). For best OCR results vertical and horizontal resolutions must be the same. See [http://finereader.abbyy.com/guide/ User's Guide] for additional information.  
*'''Recommended Image Resolution:''' 300 dpi for typical texts (printed in fonts of size 10pt or larger), 400–600 dpi for texts printed in smaller fonts (9pt or smaller). For best OCR results vertical and horizontal resolutions must be the same. See [http://finereader.abbyy.com/guide/ User's Guide] for additional information.  
Line 8: Line 8:
*'''Cropped JPEG''' images containing only the primary plant collection label are about 300-600 KB in size and take 6-10 seconds to process each.<br>
*'''Cropped JPEG''' images containing only the primary plant collection label are about 300-600 KB in size and take 6-10 seconds to process each.<br>


==== What to look out for:<br> ====
== What to look out for:  ==


*Setting the resolution over 600 dpi increases the recognition time. Increasing the resolution does not yield substantially improved recognition results. Setting an extremely low resolution (less than 150 dpi) adversely affects OCR quality. See [http://finereader.abbyy.com/guide/ User's Guide] for additional information.  
*Setting the resolution over 600 dpi increases the recognition time. Increasing the resolution does not yield substantially improved recognition results. Setting an extremely low resolution (less than 150 dpi) adversely affects OCR quality. See [http://finereader.abbyy.com/guide/ User's Guide] for additional information.  
Line 14: Line 14:
*'''Hot Folder: '''Using the Hot Folder allows for running the OCR software on batches of images. However, it does not scan barcodes in an image, only human-readable tex. When running the software on individual images (i.e. not using the Hot Folder), one can select to scan the barcode, as well as detect human-readable text.
*'''Hot Folder: '''Using the Hot Folder allows for running the OCR software on batches of images. However, it does not scan barcodes in an image, only human-readable tex. When running the software on individual images (i.e. not using the Hot Folder), one can select to scan the barcode, as well as detect human-readable text.


= <u>'''Recognition Server Tips'''</u><br>  =
= '''Recognition Server Tips''' =


==== What works best: <br> ====
== What works best:  ==


EH: start with fewer languages selected since each language adds to the time taken (potential to sort specimens geographically prior to OCR). We are currently processing our specimens from SW Asia and the Middle East with a large number from Turkey so we currently run ABBYY with Turkish and English selected.<br> EH: we select high quality rather than speed<br>  
EH: start with fewer languages selected since each language adds to the time taken (potential to sort specimens geographically prior to OCR). We are currently processing our specimens from SW Asia and the Middle East with a large number from Turkey so we currently run ABBYY with Turkish and English selected.<br> EH: we select high quality rather than speed<br>  


PL: OCR quality can be enhanced when a large image is cropped - which also reduces page count.<br> PL: Images can be ingested from a shared folder, or scan station, or ftp/ftps, or API. Hotfolder ingestion can be further controlled by including an optional XML ticket. XML tickets control workflow, output, and allow metadata to be ingested along with the image to be processed.<br>  
PL: OCR quality can be enhanced when a large image is cropped - which also reduces page count.<br> PL: Images can be ingested from a shared folder, or scan station, or ftp/ftps, or API. Hotfolder ingestion can be further controlled by including an optional XML ticket. XML tickets control workflow, output, and allow metadata to be ingested along with the image to be processed.<br>


==== What to look out for:  ====
== What to look out for:  ==


EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.<br><br>  
EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.<br><br>


= <u>'''Tesseract Tips'''</u>  =
= '''Tesseract Tips''' =


==== What works best:  ====
== What works best:  ==


Resolution: x-height (pixel height of lowercase letter) between 20-40 pixels is ideal<br> Switching to grayscale, increasing contrast, and other image treatments can improve output at times  
Resolution: x-height (pixel height of lowercase letter) between 20-40 pixels is ideal<br> Switching to grayscale, increasing contrast, and other image treatments can improve output at times  


==== <br>What to look out for:  ====
== What to look out for:  ==


Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts  
Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts  


<br>Fixing errors:  
== Fixing errors: ==


Tesseract makes characteristic errors. Some of these such as "\/\/" or "\X/" substituted for for "W" can be be globally replaced as it is highly unlikely that they would occur on their own on a label. Others such as "O" substituted for "0", "1" or "!" substituted for "l" or "Z" substituted for "2" or visa versa can be replaced in a context-dependent manner in dates, latitudes and longitudes, etc. For instance, a string containing multiple errors such as "0ct.&nbsp;!Z, ZOlZ" can be programmatically located with a regular expression and changed to "Oct. 12, 2012" or even "12-October-2012" so that it can be entered into a database.  
Tesseract makes characteristic errors. Some of these such as "\/\/" or "\X/" substituted for for "W" can be be globally replaced as it is highly unlikely that they would occur on their own on a label. Others such as "O" substituted for "0", "1" or "!" substituted for "l" or "Z" substituted for "2" or visa versa can be replaced in a context-dependent manner in dates, latitudes and longitudes, etc. For instance, a string containing multiple errors such as "0ct.&nbsp;!Z, ZOlZ" can be programmatically located with a regular expression and changed to "Oct. 12, 2012" or even "12-October-2012" so that it can be entered into a database.  


<br>Misc notes:  
== Misc notes: ==


Will often recognize vertical text<br> Image input can be tif, jpeg, or gif
Will often recognize vertical text<br> Image input can be tif, jpeg, or gif<br>


= <u>'''Omnipage features'''</u>  =
= '''Omnipage features''' =
==== Advantages of Omnipage ====
== Advantages of Omnipage ==


Very good OCR accuracy
*Very good OCR accuracy  
*Converted documents e.g. pdf-files look exactly like the original. Similar fonts.
*Supports output formats including XML, PDF, HTML, Microsoft Word, plus many more formats.
*Provides location (coordinates) of recognized words.
*Batch processing with 4 cores.
*Schedule large volumes of files for batch processing from folders.
*Recognises over 120 languages.
*Price from $149.00 (standard) to $499.00 (professional)


Converted documents e.g. pdf-files look exactly like the original. Similar fonts.
== Disadvantages ==


Supports output formats including XML, PDF, HTML, Microsoft Word, plus many more formats.
*is limited to images of 8600 lines
*Omnipage SDK estimated $6000.00


Provides location (coordinates) of recognized words.
= '''Source of page contents''' =
 
Batch processing with 4 cores.
 
Schedule large volumes of files for batch processing from folders.
 
Recognises over 120 languages.
 
Price from 149$ (standard) to 499$ (professional)
 
==== Disadvantages ====
 
is limited to images of 8600 lines
 
Omnipage SDK estimated 6000$
 
= <u>'''Source of page contents'''</u>  =
Notes on these pages are compiled from the cumulative experiences of the iDigBio Augmenting OCR Working Group and the natural history collections community members contributing their collective knowledge for the benefit of all.
Notes on these pages are compiled from the cumulative experiences of the iDigBio Augmenting OCR Working Group and the natural history collections community members contributing their collective knowledge for the benefit of all.

Latest revision as of 17:33, 3 January 2014

FineReader Tips

What works best:

  • Recommended Image Resolution: 300 dpi for typical texts (printed in fonts of size 10pt or larger), 400–600 dpi for texts printed in smaller fonts (9pt or smaller). For best OCR results vertical and horizontal resolutions must be the same. See User's Guide for additional information.
  • Fullsize, color JPEG images of herbarium specimens are ±7-15 MB in size and take about 2 minutes to process each.
  • Fullsize, grayscale JPEG images of herbarium specimens are ±1 MB in size and take about 1 minute to process each.
  • Cropped JPEG images containing only the primary plant collection label are about 300-600 KB in size and take 6-10 seconds to process each.

What to look out for:

  • Setting the resolution over 600 dpi increases the recognition time. Increasing the resolution does not yield substantially improved recognition results. Setting an extremely low resolution (less than 150 dpi) adversely affects OCR quality. See User's Guide for additional information.
  • Pattern training: If using this tool, be sure to train the tool on an image that is the same resolution as the other images you wish to OCR. This tool can be useful when running the software on many labels with the same format/fonts.
  • Hot Folder: Using the Hot Folder allows for running the OCR software on batches of images. However, it does not scan barcodes in an image, only human-readable tex. When running the software on individual images (i.e. not using the Hot Folder), one can select to scan the barcode, as well as detect human-readable text.

Recognition Server Tips

What works best:

EH: start with fewer languages selected since each language adds to the time taken (potential to sort specimens geographically prior to OCR). We are currently processing our specimens from SW Asia and the Middle East with a large number from Turkey so we currently run ABBYY with Turkish and English selected.
EH: we select high quality rather than speed

PL: OCR quality can be enhanced when a large image is cropped - which also reduces page count.
PL: Images can be ingested from a shared folder, or scan station, or ftp/ftps, or API. Hotfolder ingestion can be further controlled by including an optional XML ticket. XML tickets control workflow, output, and allow metadata to be ingested along with the image to be processed.

What to look out for:

EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.

Tesseract Tips

What works best:

Resolution: x-height (pixel height of lowercase letter) between 20-40 pixels is ideal
Switching to grayscale, increasing contrast, and other image treatments can improve output at times

What to look out for:

Resolution: an x-height below 8-12 pixels will produce very poor OCR return
Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return
Form labels can interfere with OCR output
Faded labels or images with poor lighting can be problematic
Old font can be problematic. However, it is possible to train Tesseract for new fonts

Fixing errors:

Tesseract makes characteristic errors. Some of these such as "\/\/" or "\X/" substituted for for "W" can be be globally replaced as it is highly unlikely that they would occur on their own on a label. Others such as "O" substituted for "0", "1" or "!" substituted for "l" or "Z" substituted for "2" or visa versa can be replaced in a context-dependent manner in dates, latitudes and longitudes, etc. For instance, a string containing multiple errors such as "0ct. !Z, ZOlZ" can be programmatically located with a regular expression and changed to "Oct. 12, 2012" or even "12-October-2012" so that it can be entered into a database.

Misc notes:

Will often recognize vertical text
Image input can be tif, jpeg, or gif

Omnipage features

Advantages of Omnipage

  • Very good OCR accuracy
  • Converted documents e.g. pdf-files look exactly like the original. Similar fonts.
  • Supports output formats including XML, PDF, HTML, Microsoft Word, plus many more formats.
  • Provides location (coordinates) of recognized words.
  • Batch processing with 4 cores.
  • Schedule large volumes of files for batch processing from folders.
  • Recognises over 120 languages.
  • Price from $149.00 (standard) to $499.00 (professional)

Disadvantages

  • is limited to images of 8600 lines
  • Omnipage SDK estimated $6000.00

Source of page contents

Notes on these pages are compiled from the cumulative experiences of the iDigBio Augmenting OCR Working Group and the natural history collections community members contributing their collective knowledge for the benefit of all.