Data Standards for Sharing & Hands-On Experience with the Integrated Publishing Toolkit (IPT)

From Deb Paul, on twitter @idbdeb

Got Data? We’re all ready to help you get your data discovered! Researchers, Collection Managers, Data Managers, Taxonomists, and Curators gathered in Ottawa, Canada and in Gainesville, Florida, on January 13 -14, 2015, to learn more about best practices for how to develop and share robust natural history collection specimen data. We benefitted from the dedicated efforts and combined experience of the planning team members from the Global Biodiversity Information Facility (GBIF), Canadensys, VertNet, Agriculture and Agri-food Canada, Canadian Biodiversity Information Facility (CBIF, a GBIF node), USGS-BISON, and iDigBio. Special thanks go to Heather Cole, from CBIF in Ottawa, and Kevin Love from iDigBio, who worked their magic to make it possible for all 74 of us to be in two places at once. We could not achieve this level of collaboration and expertise-sharing without their skills.

To produce this workshop, the planning team created materials and hands-on activities to introduce participants to the Ecological Metadata Lanuage Standard (EML), Darwin Core, Audubon Media Description (aka Audubon Core), Global Genome Biodiversity Network (GGBN) extensions and the Material Sample Core data standards. We delved into some spreadsheet skills for manipulating data to address data quality issues like the need for data standardization for standard date formats, UTF-8 encoding, literal values, and taxon name standardization. In general, we also learned a lot about GBIF and data publishing.

This course included a hands-on introduction to the GBIF Integrated Publishing Toolkit (IPT) to show how it facilitates:

data mapping,
the production of mapped data in a standard format known as a Darwin Core Archive (DwC-A),
extensions use,
robust metadata for enhancing data discovery,
and reveals what registration and publication of a dataset is all about.

Thanks to the IT staff in Ottawa and Gainesville, each participant logged in to their own copy of the IPT software so they could experience first-hand what the software does. Would you like to try out the IPT for yourself? It's easy! Visit the GBIF IPT Sandbox to sign up for a free account. You don't need data to try it either, you can use sample data you'll find there. Some extensions are still under review by the community. If you visit the IPT Sandbox, you can see all of them, including the GGBN extensions.

Some participants were interested in learning more about how one goes about installing an instance of the IPT software for their own institutions. For this, the workshop organizers worked hard to put together a comprehensive 2 hour webinar prior to the workshop. All presentations and a video are there for you to use and share.

Some workshop highlights.

Workshop participants asked lots of great questions and we asked them to share what they learned, but also what they realized they still need to learn more about. Check out the Google Doc, to read some of the questions asked, and some of the answers.

Discussion sessions included questions like: Should data be “perfect” before publishing? No, because it will take too long to create datasets of the size needed to support biodiversity research. Still, data quality matters so it's important for each of us to know what we can do to facilitate data quality - so we can do our part. Participants asked about linking genomics data. We explained about providing GenBank accession numbers, but also briefly talked about the new Material Sample Core, and the new GGBN extensions - making it possible to share lots of genomics data, and genomics metadata related to your biological and paleontological specimens. Another insightful question: "Where to share my data?" Answer: everywhere, and use globally unique identifiers when sharing your data. Robust use of and sharing of globally unique identifiers mean you can share your data in more than one place to enhance discovery and data re-use.

Some participants discovered hidden characters in their datasets. Participants saw this first-hand when some of them opened their own datasets in Excel and saw the expected number of rows, but when opening in the GBIF IPT, saw a message showing many more than the expected number of rows.

Seen in Excel, each data row seems to be fine, beginning with the Specimen Number as expected (column A). Using Notepad++ (or other text-editing software) you see the reason for the extra rows. There are extra “hard returns.” Getting rid of the extra hard returns fixes this. It’s just one example of an opportunity to find out if you can either change what’s happening in your database, or see if you can modify your data export methods to prevent these issues.

Some participant thoughts.

Going through the process [mapping] with both an example data set and my own data was an excellent exercise. It was also useful to see what issues other participants experienced with their data.
That there's a good community around the tools. There are people we can get support from.
I would love to have another workshop on data cleaning and tools such as Google Refine.

Thanks to detailed and plentiful feedback, we have lots of new ideas from you for more workshops, materials, and webinars. Stay tuned! We're excited to say that some of you told us that you'll now have new data to share that's never been out in the world before. This is great news, and we all look forward to helping this happen in any way we can. And, thanks again to all who made this workshop possible.

Wondering what's coming up next? Keep up with iDigBio. Join an iDigBioListserv, join a working group, subscribe to our calendar and our newsletter, the iDigBio Spotlight. Still have questions? You can find expertise to answer your questions on our listservs and also from members of our working groups. All are welcome.

Note: This is the second in a series of biodiversity informatics workshops focusing on workforce training. The first of these was Data Carpentry, the next is Field to Database (March 9 – 12, 2015 Gainesville), and the fourth is Managing Natural History Collections Data for Global Discoverability (September 15 – 16, 2015 Arizona State University).

Data Standards for Sharing & Hands-On Experience with the Integrated Publishing Toolkit (IPT)

Researchers

Collections Staff

Teachers & Students

Language