Technical Issues: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
Line 4: Line 4:


----
----
== General Notes and Observations ==
*When using Tesseract, an image with graphics or even a simple black border can cause interference that will affect OCR output. For example, the following images will produce no output using Tesseract.
**http://storage.idigbio.org/vsc/bryophytes/VSC-L02/VSC-L02073.jpg
**http://storage.idigbio.org/vsc/bryophytes/VSC-L00/VSC-L00118.jpg

Revision as of 11:27, 8 August 2012

Resolution

The image resolution needed for OCR return is a tricky subject. The general rule commonly heard is that the preferred resolution of 300 dpi (dots per inch) is necessary for good OCR return. However, this generally applies to images obtained from a scanner and dpi values can be misleading when a camera is used to image text. Dpi is only relevant if the document ratio is 1:1 and with a camera, this can vary depending on camera placement and distance between camera and text document. Font size is another factor that can lead to poor OCR output. For instance, a 16pt font at 200 dpi will return better OCR results than an 8pt font at 300 dpi. Therefore, a better measure of image resolution for OCR purposes is obtained by counting the x-height of the text. X-height is the pixel height of a lower case x within the document. According to Tesseract, an x-height of 20 pixels or better is preferred. See Tesseract's FAQ for more information concerning this issue.


General Notes and Observations