More news from the iDigBio Augmenting OCR Working Group (aOCR wg)
The aOCR wg efforts to find ways to speed up digitization continue. Our working group put together a Wish List and then, at our first "in-person" meeting of the working group in October of 2012, we decided which items in the Wish List to work on first, how to go about working on them, how to share what we are learning, and with whom to share our digitization desires.
Got questions about OCR output? Want to add your voice or expertise? Wondering who is a member of the current working group? Got ideas for our web services? OCR questions? Is there a workshop, or workshop topic, that you'd like to see at one of our digitization workshops? Let us know!
Would you like to come to our working group meetings? Do you have OCR protocols and workflows to share? Upload them to our aOCR wiki!
Left to Right (more or less): Robert Anglin (LBCC TCN and Symbiota), Robin Schroeder (ScioQualis.com), Dmitry Dmitriev (Illinois Natural History Survey), Kimberley Watson (New York Botantical Garden), Daryl Lafferty (SALIX2), Edward Gilbert (Symbiota), Stephen Gottschalk (New York Botanical Garden), Scott Bates (Macrofungi TCN), Dmitry Mozzerhin (seated, Marine Biological Laboratory), John Mignault (BHL & NYBG), Tianli Mo (Joseph F. Rock Herbarium), Kevin Love (iDigBio), Paul Schroeder (ScioQualis.com), Qianjin Zhang (LABELX), Alex Thompson (iDigBio), John Pickering (Discover Life), Debbie Paul (iDigBio). Not in photo: Bryan Heidorn (LABELX, University of Arizona), Phuc Xuan Nguyen (UCSD Computer Vision Labs & CalBug Project), Sean Murphy (BRIT), Jason Best (BRIT), Amanda Neill (BRIT; taking the photo!), Richard Eaton (BRIT), Ben W Brumfield (FromThePage.com), Steven Chong (remote participant), Michael Giddens (remote participant).
So Many Questions!
Every digitization project involved in imaging specimen labels, field notebooks, card files, logs, etc., wants to be able to get the data from those images automatically parsed into standard fields (think Darwin Core or other data standards) and into databases. Some data sets are more suitable for this digitization workflow than others, though. Those using images this way would like to know just exactly what is possible: What can current algorithms parse successfully (or not)? Which algorithms return the best (or worst) result? Which OCR software performs the best? What image processing or algorithm tweaking can be done to improve the OCR output and parsing? What web services need development in order to make these advancements available to the greater community? Who on-the-planet has the skills we need to make these improvements in parsing, image processing, web services and user-interface tool integration? How to do we find them and invite them to join us? How do we make sure we're not reinventing a tool some other group has already created?
Our Plans: Answering a Few of Those Questions.
Part one of our initial strategy includes Education and Outreach, both internal and external. For example, inside our community, we'd like to share what this group already knows about getting the best performance out of any particular OCR package, including:
For a herbarium-type label, good parsing algorithms have been shown to be 30% faster than typing (SALIX, SALIX2 in development). But OCR output is good for more than parsing! Studies show that those doing data transcription find the task more pleasurable and less tedious when working from datasets they can order by querying the OCR output (Haston, 2012). Querying OCR output automatically produces ordered data sets and therefore means less repetitive typing. Researchers also share that, after imaging and before digitization, OCR output from images provides a fast way to create research data sets for anyone that needs them - long before individually complete atomized records are done. So, for those projects taking images of labels or other documents containing specimen data, it makes sense to put OCR into the digitization process to get the most from the images and capitalize on the human interaction with the data from these images.
Post-office parsing algorithms are being used by with good results for projects like LABELX. These algorithms are very good…but not perfect for our purposes. The difference is in the data: the post-office labels are standard and predictable, making it easy to write algorithms because the data is normalized. Legacy natural history museum data, on the other hand, is not standard. Patterns do exist, however, and machine-learning strategies make it possible to continue to improve the parsing of these data.
From inside and outside our community, we want to continue to answer the questions and learn from each others' expertise in digitization, image analysis, image processing, human-in-the-loop data transcription, and parsing algorithm development.
Part two of our plan includes working on specific issues. For our first project, we wanted to see what we could document about what is - or is not - possible with parsing. In doing this, we also hoped to create a baseline the entire community could use as a starting point when making digitization workflow decisions. At our first aOCR wg meeting in October, 2012, we decided that the best way to address this question would be to organize a hackathon. In addition, since the Information Science community has been working on digitization longer than the biological and paleontological communities, we thought it important to reach out to them to show them what we are doing, to get their input, and to engage them in our digitization challenges!
Left to Right: Bryan Heidorn, Stephen Gottschalk, Ben W Brumfield, Daryl Lafferty, Alex Thompson
Hackathon to establish a Baseline on Parsing, iConference Participation, and Outreach Efforts.
In February, we spent a week in Fort Worth, Texas, at iDigBio's 1st Hackathon. During that same time period, we were also making presentations at the iSchools iConference 2013 in order to engage the Information Science community in our efforts.
Tuesday, February 12th,iConference 2013: We held a half-day workshop at iConference 2013 entitled: "Help iDigBio Reveal Hidden Data: iDigBio Needs You." Most of the participants at this workshop were very engaged graduate students, and we look forward to their continued participation in our digitization work. In the workshop, we organized our panel talks to tell a story, beginning with Who/What is iDigBio and the aOCR wg and Why are we at iConference2013?, working our way through the scope of the challenge (Amanda Neill, BRIT), human-in-the-loop text-extraction/parsing strategies (Jason Best, BRIT; Edward Gilbert, Symbiota), parsing algorithm development (Bryan Heidorn, LABELX), and ending with a summary presentation entitled The Biodiversity Heritage Library (BHL) and linked data (John Mignault, BHL & NYBG), which encompasses our overall goal to not only make data accessible, but to integrate those data, linking it in a semantically meaningful way (aka "The Semantic Web"). You can read the abstracts (and soon, download the talks). Listen to the unedited panel recording.
Wednesday and Thursday, February 13 and 14, iDigBio Hackathon: Our 1st Hackathon began with much planning, and necessitated that we compile three data sets (over 25,000 images) from lichen, herbarium, and entomology collections, create three human-standardized test data sets, develop metrics to evaluate parsing results, and invite participants to five pre-hackathon organizational meetings - all of this in only 4 months! After all that, the Botanical Research Institute of Texas (BRIT) was our home for two days, February 13 and 14, and most of that time was spent parsing data sets and sharing results to make improvements. Some participants’ expertise and interests inspired them to lead the way in helping iDigBio to begin developing web services (automated approaches) to interact with, parse, and use the data in the OCR output files. We greatly appreciate the on-going efforts of those individuals, including Paul Schroeder, Robin Schroeder and Michael Giddens (remote participant), as they develop the web services, establish a list of what we need, and ultimately set up a test version. After development, these types of services will be available through the iDigBio API (application programming interface) for use by other IT staff on additional digitization projects. Also during the workshop, participants including Robin Schroeder, John Pickering, and others interested in workflows utilizing OCR took the time to draft an OCR workflows v1; plans are underway with the iDigBio Workflows working groups (DROID) to incorporate this into their modules.
We also want to acknowledge the contributions of Phuc Nguyen, a graduate student (nearly finished) from the CalBug project in the University of California, Computer Vision lab. From Phuc, we learned a lot about what is currently possible with automated image segmentation. Imagine being able to automatically locate writing/typing in an image that also includes other objects like insects or fish. This research is very promising, and shows that while human-in-the-loop strategies are key to our digitization methods, we are going to have much more elegant ways to make the best use of a human's input. Similarly, imagine being able to automatically sort images into those with and without handwriting. One participant and blogger, Ben W. Brumfield (FromThePage.com), came up with a way to use OCR output to do just that. You can read more about Ben’s strategy here: iDigBio Augmenting OCR Hackathon. This type of workflow would mean that images suitable for OCR would go in one pile while those with lots of handwriting would go directly to humans for parsing – but in an automated fashion. Records populated from automatic parsing would be sent to humans for quality assurance. The benefits: two different workflows, but 1) no human needed to make a decision at the beginning and 2) ordered data sets easily created for simpler and faster human input! Along these same lines, Jason Best at BRIT is working on an algorithm program "DarwinScore" to automatically score the auto-parsed output to further help differentiate images that can be successfully parsed by computers and those whose output needs more human intervention.
Wednesday and Thursday, February 13 and 14, iConference: On Wednesday and Thursday evenings, the aOCR wg presented a poster at iConference 2013. The published poster abstract is available online here. Check it out to learn more about various members of our working group. On Thursday morning, aOCR wg member Bryan Heidorn also presented and discussed a short paper (abstract) in a Digital Libraries session at iConference. This short paper, entitled "Augmenting Optical Character Recognition (OCR) for Improved Digitization: Strategies to Access Scientific Data in Natural History Collections," introduces the Information Science community to iDigBio and explains the aOCR working group's goals and challenges. In addition, three papers discussed in the iConference Digital Libraries session focused on issues faced by digital libraries, including standards development, data sharing, and human data-finding behavior research.
Friday, February 15th, iConference: Some members of the aOCR wg hosted an Alternative Event (new at iConference 2013). At the Alternative Event, we facilitated a round-table discussion session to discuss the Hackathon with information scientists attending the iConference. We met several new people, enjoyed a lively discussion, and as a result have been introduced to even more people who are looking forward to contributing to some of our working groups (like the new working group centered around Biodiversity Informatics Managers). Exactly the type of connections we were hoping for!
Left to Right: Using Adobe Connect for 5 pre-hackathon meetings: Debbie Paul, Kimberly Watson, Paul Schroeder, Robin Schroeder, Ben Brumfield
Social media: Facebook and Twitter continue to play a role in our events. While at the Hackathon and iConference, both Facebook and Twitter provided a place for those taking part to have a conversation with those present and those attending remotely at the same time. Through Twitter, we also found another community, The New York Public Library Labs (NYPL Labs), that is interested in what we are doing - we look forward to more discussions and knowledge-sharing with them! Take a look at us at work in the hackathon photos on Facebook!!!
Coming Up Next: We look forward to continuing our hackathons virtually! We owe many thanks to NESCent, and Hilmar Lapp in particular, for help with the basics of organizing a hackathon! We learned a lot and have begun the process of putting all this information together for dissemination: we are working on a white paper to share our hackathon experience so others thinking about this can benefit from our ups - and our oops! We'll keep you up to date as we go forward with this. In addition, we're discussing how to move forward with the development of web services to make useful OCR and OCR output analysis tools available to our community if they are using label images in their digitization workflows. Participants Edward Gilbert (Symbiota), Bryan Heidorn (LABELX), Daryl Lafferty (SALIX2) and others are hard at work making changes and integrating new algorithms from the workshop into their existing software. Stay tuned for more about these aOCR working group outcomes!!!
There's more, but this is supposed to be a blog - not a book! If you got this far, thank you from all of us in the working group!
And the thank yous! There are so many people to thank! This would not have been possible without the hard work of everyone involved and everyone in the community that helped when we had questions. We'd really like to hear from you! Got questions about OCR output? Want to add your voice, your expertise? Got ideas for our web services? Got OCR questions? Is there a workshop, or workshop topic about this you'd like to see at one of our digitization workshops or your digitization workshops? Let us know! Would you like to come to our working group meetings? If you've got questions or input for us, we are here for you.
Thanks for reading!