Difference between revisions of "OCR Tips"

From iDigBio
Jump to: navigation, search
(Created page with "== Recognition Server == ABBYY Effective Practices & Hints. - what works best EH: start with fewer languages selected since each language adds to the time taken (potent...")
 
Line 1: Line 1:
 
== Recognition Server  ==
 
== Recognition Server  ==
  
ABBYY Effective Practices & Hints.  
+
ABBYY Effective Practices & Hints for users of Recognition Server - what works best in our experiences.  
  
- what works best
+
From Elspeth Haston:
  
EH: start with fewer languages selected since each language adds to the time taken (potential to sort specimens geographically prior to OCR). We are currently processing our specimens from SW Asia and the Middle East with a large number from Turkey so we currently run ABBYY with Turkish and English selected.<br> EH: we select high quality rather than speed<br> PL: OCR quality can be enhanced when a large image is cropped - which also reduces page count.<br> PL: Images can be ingested from a shared folder, or scan station, or ftp/ftps, or API. Hotfolder ingestion can be further controlled by including an optional XML ticket. XML tickets control workflow, output, and allow metadata to be ingested along with the image to be processed.<br>  
+
EH: start with fewer languages selected since each language adds to the time taken (potential to sort specimens geographically prior to OCR). We are currently processing our specimens from SW Asia and the Middle East with a large number from Turkey so we currently run ABBYY with Turkish and English selected.<br> EH: we select high quality rather than speed<br>  
 +
 
 +
PL: OCR quality can be enhanced when a large image is cropped - which also reduces page count.<br> PL: Images can be ingested from a shared folder, or scan station, or ftp/ftps, or API. Hotfolder ingestion can be further controlled by including an optional XML ticket. XML tickets control workflow, output, and allow metadata to be ingested along with the image to be processed.<br>  
  
 
- what to look out for (with examples).  
 
- what to look out for (with examples).  
  
 
EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.<br><br>
 
EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.<br><br>

Revision as of 12:03, 2 October 2012

Recognition Server

ABBYY Effective Practices & Hints for users of Recognition Server - what works best in our experiences.

From Elspeth Haston:

EH: start with fewer languages selected since each language adds to the time taken (potential to sort specimens geographically prior to OCR). We are currently processing our specimens from SW Asia and the Middle East with a large number from Turkey so we currently run ABBYY with Turkish and English selected.
EH: we select high quality rather than speed

PL: OCR quality can be enhanced when a large image is cropped - which also reduces page count.
PL: Images can be ingested from a shared folder, or scan station, or ftp/ftps, or API. Hotfolder ingestion can be further controlled by including an optional XML ticket. XML tickets control workflow, output, and allow metadata to be ingested along with the image to be processed.

- what to look out for (with examples).

EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.