Five iConference2013 Talks: Difference between revisions

Jump to navigation Jump to search
m
Line 12: Line 12:
:::;Jason Best: Abstract
:::;Jason Best: Abstract


===Symbiota -- Creating an OCR and NLP enabled user interface and workflow to efficiently digitize 2.3 million lichen and bryophyte specimens===
=== [https://www.idigbio.org/sites/default/files/workshop-presentations/aocr-hackathon/iSchool.pptx Symbiota -- Creating an OCR and NLP enabled user interface and workflow to efficiently digitize 2.3 million lichen and bryophyte specimens] ===
:::;Edward Gilbert: This talk focuses on how the Lichen and Bryophyte Climate Change TCN (LBCC) project plans to make use of Optical Character Recognition (OCR) and Natural Language Processing (NLP) to help process over 2.3 million North American lichen and bryophyte specimens held within over 60 U.S. herbaria. A strong component of the LBCC image processing workflow includes separate automated OCR and NLP processing steps that hope to capture a significant component of the metadata found on the specimen labels. After converting specimen images to text fragments using the open source Tesseract OCR engine, previously developed and custom designed NLP algorithms will be used to parse the results into the appropriate Darwin Core fields. The resulting data will be then be imported into a central database and made public through a unified user interface. The final steps will include transferring the results back to the management database systems used by each collection.
:::;Edward Gilbert: This talk focuses on how the Lichen and Bryophyte Climate Change TCN (LBCC) project plans to make use of Optical Character Recognition (OCR) and Natural Language Processing (NLP) to help process over 2.3 million North American lichen and bryophyte specimens held within over 60 U.S. herbaria. A strong component of the LBCC image processing workflow includes separate automated OCR and NLP processing steps that hope to capture a significant component of the metadata found on the specimen labels. After converting specimen images to text fragments using the open source Tesseract OCR engine, previously developed and custom designed NLP algorithms will be used to parse the results into the appropriate Darwin Core fields. The resulting data will be then be imported into a central database and made public through a unified user interface. The final steps will include transferring the results back to the management database systems used by each collection.
::::::The presentation will begin with a brief overview of the LBCC project along with a simple workflow diagram. The OCR/NLP components of the workflow will be emphasized in greater detail. The Symbiota user interface will be displayed with a short explanation of how it will be used to digitize the label metadata. Current limitations and challenges with OCR and NLP processing will be explained as they pertain to this project. Several solutions to these issues that are being explored by the project will be explained in detail along with their preliminary results.
::::::The presentation will begin with a brief overview of the LBCC project along with a simple workflow diagram. The OCR/NLP components of the workflow will be emphasized in greater detail. The Symbiota user interface will be displayed with a short explanation of how it will be used to digitize the label metadata. Current limitations and challenges with OCR and NLP processing will be explained as they pertain to this project. Several solutions to these issues that are being explored by the project will be explained in detail along with their preliminary results.
4,713

edits

Navigation menu