4,707
edits
No edit summary |
No edit summary |
||
Line 9: | Line 9: | ||
- what to look out for (with examples). | - what to look out for (with examples). | ||
EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.<br><br> | EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.<br><br> | ||
== Tesseract Tips == | |||
Tesseract Effective Practices & Hints. | |||
<br>What works best: | |||
Resolution: x-height (pixel height of lowercase letter) between 20-40 pixels is ideal<br> Switching to grayscale, increasing contrast, and other image treatments can improve output at times | |||
<br>What to look out for: | |||
Resolution: an x-height below 8-12 pixels will produce very poor OCR return <br> Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return<br> Form labels can interfere with OCR output<br> Faded labels or images with poor lighting can be problematic<br> Old font can be problematic. However, it is possible to train Tesseract for new fonts | |||
<br>Misc notes: | |||
Will recognize vertical text<br> Image input can be tif, jpeg, or gif<br> |