Digitizing Biological Collections:
Global Issues and Decision Points
Imaging and databasing a biological collection seems like a straightforward task: procure the specimens, extract the data, take the image, and serve the image and data on the internet. However, in most cases there are numerous preliminary global issues and decision points to be resolved before actual imaging and data extraction can begin.
Perhaps most important is determining how specimen images will be used. Will they be primarily curatorial in nature, destined to help managers track and administer the collection? Or, will they be used mostly to allow the general public a glimpse of the treasures the collection holds? Will research scientists use them to discover and describe new species, elucidate evolutionary relationships, or better understand our biodiversity heritage? Or, will the primary audience be students, teachers, and amateur enthusiasts interested in identification, morphology, and classification? All of these purposes and many others, in combination or alone, are valid expectations. Clearly outlining the uses to which images will be put lays a firm foundation for addressing a suite of important decision points, including the need for pre-imaging curation, development of imaging station specifications, determining workflows, preparing for label data extraction, and deciding digital storage parameters.
Pre-imaging curation is essential for ensuring that that a collection is properly prepared for digitization. Since every collection is in various stages of curation, it is helpful to determine which parts of the collection will be digitized first and the specific tasks that must be performed to make them ready. It may be that the most actively and well-curated specimens provide the most efficient starting point, or perhaps it is more important to begin with taxonomic groups of special interest to local staff. Alternately, curators may want to use impending digitization as the raison d’être for focusing on those parts of the collection that are most in need of curatorial attention.
Resolving nomenclatural issues and updating determinations may be among the most important tasks for curators. Indeed, ensuring that data and images served to the public are as accurately determined as possible is important. However, since one of the main goals of exposing collection data to the internet is to invite community input and involvement, the opportunity to solicit opinions from professional and amateur colleagues should be viewed as an essential benefit and a method for involving the broader community in collections management.
Other pre-imaging curation activities might include separating or removing labels for imaging or data capture, inserting bar codes or other unique identifier tags, organizing trays or folders, checking for damage, and cleaning or reattaching specimens. While nomenclatural and taxonomic issues should be dealt with by collection managers or professional biologists, these latter tasks can often be accomplished by adequately trained, meticulous assistants or volunteers.
Imaging station specifications derive directly from proposed image use and desired morphological or anatomical coverage. Drawer or tray images may benefit from robotic scanning configurations, like GigaPan, or stacking software similar to AutoMontage. Planar specimens of sufficient size can be imaged with a single exposure from a digital camera mounted to a copy stand. Small specimens less than about 2.5 cm long will benefit from macro or micro lenses, whereas specimens less than 5 mm may require a microscope or long distance microscopic lens attached to a digital camera. At a minimum, an imaging station will require a dedicated computer with image processing software, a digital camera, microscope, scanner or other imaging device, appropriate lenses, bar code scanner, lighting, a frame or harness to hold specimens and/or labels, an appropriate background, and potentially a suite of accessories.
Cost, efficiency, and image quality are intricately related issues. Fashioning protocols and workflows that maximize image quality and data capture at the lowest reasonable cost per specimen is an important consideration for all collections, but especially for those with several hundred thousand to several million specimens. Micro analysis of the individual steps in a work flow has the potential to pinpoint inefficient or repetitive non-productive motions that can be streamlined for increased productivity. Saving even a second or two per step can result in large savings in time and money when spread across millions of specimens.
Designing effective workflows is dependent on the type of collection being digitized, the amount and type of data to be extracted, the level of detail to be databased, and the experience and number of staff available to do the work. A fully functional workflow might include specimen preparation, image capture, image processing and conversion, label data capture, georeferencing, databasing, and image storage. Whether these steps will all be part of a single linear workflow or segregated into two or more ancillary workflows is dependent upon the unique conditions of the collection being digitized.
Among the final steps in digitizing workflow is processing and storing the final images. Common practice suggests that image processing or alteration—such as cropping, color balancing, and conversion to JPG or TIFF for presentation on the internet—be performed only on copies of the original images, and that all original unprocessed images be archived in a universally accessible raw data format. Some digital camera manufacturers use proprietary raw image formats with their own unique filename extensions. For example, Nikon cameras create Nikon Electronic Format files with an NEF extension, whereas Canon’s raw files are stored with the extension CR2. Since neither of these file formats is universally accessible, inserting a step into the workflow to accommodate conversion to a digital negative (DNG) file format will be required. Converting image files to a standard format is especially important when images from several institutions or collections will be stored and potentially processed in a centralized offsite storage facility.