Tales from A Data Carpentry Workshop: In Demand

by Deb Paul @idbdeb, Kevin Love, and Matt Collins

While 27 students were lucky enough to get into this first-ever two-day Data Carpentry course, over 62 people were on the wait list! (And this doesn't include those who decided not to add themselves to that wait list). Why were they so eager to enroll?

Tracy Teal teaching R at Data Carpentry Workshop #1 at NESCent May 2014

Science curricula (undergraduate and graduate) across the world are in flux at the moment as institutions update courses and design new course content to provide students with up-to-date skills for analyzing and managing ever-larger datasets for effective, reproducible, collaborative scientific research. This Data Carpentry workshop, hosted at the National Evolutionary Synthesis Center (NESCent), immersed students in how to get data out of Excel (into a database), and use powerful tools like R, the so-called shell, and SQL in order to create repeatable workflows that result in publication-ready materials and analysis.

Backgrounds of the participants included evolutionary biology and ecology, microbial ecology, fungal phylogenomics, marine biology, environmental engineering, and a library scientist. The idea for this first-in-a-series of Data Carpentry workshops, designed for beginners, came from a meeting of the COLLAB-IT working group. Members in this working group come from the various NSF biocenters (including: NESCent, BEACON, NimBios, iPlant, NEON, SESYNC, and iDigBio). At last years' working group meeting, we found our stakeholders have common needs when it comes to computational and data literacy.

When asked about why they came to this workshop, the students were not shy in sharing their data challenges saying things like (thanks Karen Cranston for the list!):

  • I usually manage data in Excel and it's terrible and I want to do it better.
  • My advisor insists that we store 50,000 barcodes in a spreadsheet, and something must be done about that.
  • I want to teach a reproducible research class.
  • I want to use public data.
  • I work with faculty at undergrad institutions and want to teach data practices, but I need to learn it myself first.
  • I'm organizing GIS data and it's becoming a nightmare.
  • I'm trying to reboot my lab's workflow to manage data and analysis in a more sustainable way.
  • I'm re-entering data over and over again by hand and know there's a better way.

Each Biocenter has plans to offer its own version of this course, tailored to the needs of their stakeholders, and improved upon each time from participant feedback. The Data Carpentry team is following the Software Carpentry (SWC) workshop / training model and working with SWC and Mozilla Science Labs to continue improving the workshop. At iDigBio, we're planning to host our own version in fall or winter (so stay tuned - sign up for our iDigBio e-newsletter or iDigBio Listserv)! Three of us from iDigBio participated as assistants and observers in preparation for learning what it takes to put together a Data Carpentry workshop.

Matt Collins, iDigBio System Administrator and Workshop Assistant, comments on his workshop experience and ides for future versions: The popularity of this workshop and the enthusiasm of the participants were amazing. Developing technical skills for conducting life science research is a seriously under served area. Engineering schools have long required basic programming and data analysis classes but the other disciplines often do not get that opportunity in most curriculum. It is great to see so many people seeking out these skills to improve their data quality and scale up their research.

Instructors were Karen Cranston (NESCent) teaching us about the shell, followed by Tracy Teal (MSU/BEACON) revealing the power of R, Ethan White (Utah State University) demystifying SQL, and Hilmar Lapp (NESCent) putting it all-together into a workflow. Along-side the instructors, we were fortunate to have five assistants: Darren Boss (iPlant), Matt Collins (iDigBio), Deb Paul (iDigBio), Mike Smorul (SESYNC), and Kevin Love (iDigBio). Assistants, with some training from Software Carpentry, are potential instructors for future workshops. NESCent hosted this event and provided yummy snacks and the Data Observation Network for Earth (DataONE) sponsored this workshop. Thanks!

Are you intrigued? Want to become an instructor? Contact any of us mentioned in these postings. Want to find out when the next course will be offered? Sign up for our iDigBio e-newsletter.

Some Lessons Learned / Reinforced. (See more at Inaugural Data Carpentry workshop)

  • Students, and current researchers, are eager to acquire these skills. Demand is high and our small group is working on ways in which we might use a train-the-trainers model to reach more would-be workshop participants.
  • It seems we might teach SQL before R next time, and this may help students understand R syntax better.
  • We used the same dataset (a real one) for the entire course. It came from one of the instructors. Direct knowledge of the dataset particulars – very helpful!
  • Students and helpers need a guide as to where the shell-scripting lesson is going. Else, when a helper looks down to assist a student where they are stuck, we both get lost when we look back up.
  • The dataset was very tidy indeed. We did not have time in this course, to address another more real-world data scenario in which disparate datasets are often in great need of clean-up when it comes to standardization before they can be merged and used successfully for addressing research questions.

Looking for more Tales about Data Carpentry? You can read more at

Data Carpentry Related Links.

Follow us on twitter #datacarpentry @iDigBio @NESCent @tracykteal @idbdeb @ethanwhite @kcranstn @hlapp @swcarpentry @cbahlai