Hackathon Challenge: Difference between revisions

Hackathon Challenge (view source)

155 bytes added , 11 January 2013

m

4,713

edits

@@ Line 20: / Line 20: @@
 **The most basic form of input is OCR text in UTF-8 format from multiple engines.
 **There may optionally be OCR with exact spatial information about the location of characters on the original image.
 ***This will allow some algorithms to exploit spatial information to identify elements. This format is, however, not a main focus for this hackathon.
+*Some data dictionaries and authority files may be provided (or you may use those you have access to) in efforts to have cleaner OCR output before parsing.
 *Those wishing to pursue other goals such as image segmentation, finding specific elements, or improving usability & user interfaces to the OCR and parsing tools are encouraged to do so and report back to the group at the hackathon.