OCR Tips: Difference between revisions

821 bytes added ,  2 October 2012
no edit summary
No edit summary
No edit summary
Line 9: Line 9:
- what to look out for (with examples).  
- what to look out for (with examples).  


EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.<br><br>
EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.<br><br>
 
== Tesseract Tips  ==
 
Tesseract Effective Practices &amp; Hints.
 
<br>What works best:
 
Resolution: x-height (pixel height of lowercase letter) between 20-40 pixels is ideal<br> Switching to grayscale, increasing contrast, and other image treatments can improve output at times
 
<br>What to look out for:
 
Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts
 
<br>Misc notes:
 
Will recognize vertical text<br> Image input can be tif, jpeg, or gif<br>
4,707

edits