Data Carpentry - Please can we have some more?!

iDigBio and the American Museum of Natural History (AMNH) co-hosted a Data Carpentry Workshop on Monday and Tuesday, September 29 – 30, 2014.

What skills do researchers in the life sciences need to be equipped with today to address current issues facing our planet? How can they make best use of all the data available to them, now, and in the future?

To start off our Data Carpentry Workshop, University of Florida (UF) Botany Professor and iDigBio PI, Pam Soltis, shared her vision and historical perspective on the skills researchers need to make best use of data, now and going forward. From her own thorough grounding in statistical methods, Pam highlighted how changes in science, and data, necessitate the researcher’s need for new skills in her talk: Linking Heterogeneous Data in Biodiversity Studies: the need for data carpentry.

For two intensive, information-filled days of hands-on learning designed for beginners, 31 students tackled improving their spreadsheet skills, learned about the power of Open Refine to clean data and reveal data patterns via facets and clustering algorithms, discovered the power of the shell, found out just how simple it can be, to get a dataset from a spreadsheet into a database to make use of structured query language (SQL), and got an introduction to R for data analysis and visualization.

Broadening Participation.

Graduate students made up 60% of the participants, the other 40% were university faculty and staff. Nine students participated via Adobe Connect from the AMNH, including students from the City College of New York (CUNY), AMNH - Columbia University, and Hunter College. Three Information Science students from Florida State University (FSU) joined the UF students, faculty, and staff to make 31 participants total. Across diverse fields, there is a demand for beginner-level courses introducing researchers to up-to-date computational literacy, data literacy, and data management skills. Disciplines of participants ranged across Physics, Earth Sciences, Ecology, Zoology, Epidemiology, Botany, Genetics, Engineering, Social Science, Humanities, Tech Support, Public Health, and Information Science.

The Workshop Experience.

All available workshop slots at UF and AMNH filled in just 3 days, with four people left on the wait-list at UF. With a student-teacher ratio of 3:1, everyone found someone nearby, ready and willing to assist, if they ran into tricky bits.

The iDigBio Data Carpentry Workshop Wiki reveals all materials used and topics covered, and includes recordings, notes taken, links to the datasets and materials on GitHub, the participant list, and more. Using Adobe Connect (AC) software and Kevin Love’s know-how, UF and AMNH students met each other virtually to learn together and share problem-solving strategies. We took notes together using a MoPad, with help from our remote assistant from USGS-BISON, Derek Masaki. Thanks Derek! Scenes from the workshop are up on the iDigBio Facebook pages.

Tracy K Teal, Professor at Michigan State University (MSU) in Microbiology and Molecular Genetics, walked us through better spreadsheet skills and the power of the shell. Deb Paul (that’s me), highlighted the importance of quality data and showed how one tool, Open Refine, can be part of your scientific workflow to enhance your data and its fitness-for-use. Matt Collins (iDigBio Systems Administrator) provided a hands-on step-by-step introduction for us to the world of relational databases and SQL. All of these skills lead up to an interactive introduction to the scripting language, R, taught by Francois Michonneau, PhD candidate (Marine Invertebrates) at UF. Katja Seltmann, Entomologist and Project Manager for the Tri-Trophic Thematic Collection Network (TTD-TCN), provided instruction in the remote location – AMNH. In addition to our 5 instructors, we also had assistants to make sure no one gets too lost, or waits too long for help. The workshop depends on assistants to run smoothly. Part of the process of becoming a Data Carpentry instructor requires attending a Data Carpentry workshop, and assisting at one. Several of our assistants are in the process of becoming Data Carpentry certified.

AMNH students report they can’t wait to do this again. All at UF and AMNH are clamoring for more R, eager to pick up where we left off on day two, just as Francois got to the good stuff (in R) with his amazing demonstration of the power of all these skills combined. We’re thinking that Data Carpentry courses, normally two days, need a third day.

A bit on Assessment (more on this in a future post).

For assessment, Data Carpentry courses use not only pre and post workshop surveys, but also minute cards. Periodically, after a course module, students are asked to write down one thing they learned, and one thing they still find confusing. This immediate feedback provides mid-course correction opportunities, as well as valuable input for next courses. Some examples of minute card comments from our Data Carpentry workshop…

Something I learned

Something I still find confusing

Be careful with naming files, don’t use spaces

I have my own versioning schema. Are there standards for versioning?

Export spreadsheet data as CSV, or perhaps TSV

I’m still a bit confused about when to use () and [] in the same line

Basic R syntax

What are the benefits of using R as opposed to SPSS? <excluding cost>

Never understood cbind() before [now I do]

Still confused on some terminology – objects vs. variables? Vectors vs. factors?

Our post-workshop survey resulted in an overall workshop grade of A- and many comments indicating the desire for more such focused, hands-on training, targeted at beginners – and designed with the biodiversity researcher in mind. What are some lessons learned at this workshop? Our remote participant strategy seems to have worked well to extend the reach of our workshop beyond UF. Keys to making a remote workshop site (AMNH) successful include having an:

  1. on-site instructor in the remote location who is familiar with all the course materials and the skills being taught
    1. in the event the connection is lost, the remote instructor can carry on with the lessons
  2. instructor, or other individual in the remote location who can troubleshoot the audio / video issues that arise.

What’s Next?

  • Would you like to request a Data Carpentry Workshop? Please send an email to
  • Are you interested in becoming a Data Carpentry Instructor? We use the Software Carpentry training course to certify our instructors. Our goal is to cease to be needed because all scientists have the skills they need to manipulate their data. Until then, if you’ve got skills, want to enhance your skills, and the skill set of your colleagues in the biological and paleontological sciences community, please join us.
  • Discussions are just beginning for another Data Carpentry Workshop to be held at FSU in the Spring of 2015 with a remote location to be decided.
  • Note the broader community, across the planet, is converging on ways to define the skills that are needed and the best way to meet the demand for these skills. This includes conversations about how to get these skills into undergraduate and K-12 education so that incoming graduate students have them at the start of their advanced degree programs. For examples of this international convergence, see the upcoming Biodiversity Information Standards (TDWG) 2014 Interest Group / Task Group Meeting: Biodiversity Informatics Curriculum / Teaching and Workshop: Effective Biodiversity Data Management Training descriptions!

Please let us know your thoughts. What skills do you need? What else do we need to cover? Got an idea for where to host one of these?

Thanks for reading and stay tuned for more Data Carpentry!

If you've made it this far, you might be wondering...

Just where did Data Carpentry come from?

From the COLLAB-IT meeting in September of 2013, one break-out group coalesced an idea into action to form Data Carpentry. The IT groups from NESCent, BEACON, iDigBio, NEON, iPlant, SESYNC, DataONE, and NIMBios shared their observations about data literacy and computational literacy skills needs across the stakeholders in these overlapping communities. Course content needed to address these skills gaps make up the Data Carpentry curriculum.

Following the Software Carpentry model, Data Carpentry seeks to improve and enhance researchers skills needed to collect, manage, and analyze data efficiently. We aim to teach skills that result in reproducible, sustainable scientific workflows that result in discoverable, re-useable datasets and reproducible analysis.