3 day collaborative task: create a Data Carpentry genomics lesson with assessment modules

By: Deb Paul, Shari Ellis, Andréa Matsunaga, Blaine Marchant

From Deb

The four of us just got back from a 3-day hackathon as part of a team to collaboratively construct a Data Carpentry genomics lesson. Invited applicants including genomics researchers, educators, assessment experts, computer scientists, and graduate students gathered at Cold Spring Harbor Laboratory (CSHL). Tracy Teal (BEACON), Jason Williams (iPlant, CSHL host), Mike Smorul (SESYNC), Mary Shelley (SESYNC), Shari Ellis (iDigBio), and Hilmar Lapp (NESCent) organized this curriculum development workshop to collectively design a genomics lesson suitable for beginners.

What data and computational skills do our current and future researchers need for their –omics research? Where are they getting these skills? How can we introduce beginners to these skills and instill a life-long sense of empowerment by teaching enhanced skills that foster scientific productivity? How do we teach them and measure if we’ve been effective or not?

We used GitHub Wiki to organize and track our progress to create these materials. You can see these materials for yourself on the Data Carpentry Genomics and Assessment Hackathon Wiki. From iDigBio, four of us, Blaine Marchant (Graduate Student, Soltis Lab), Andréa Matsunaga (Assistant Research Professor, University of Florida), Shari Ellis (Project Evaluator), and Deborah Paul (me, or rather, I) (Biodiversity Informatics Specialist) brought different skills and perspectives to materials development. Shari provided assessment guidance, Blaine brought the plant-genomics graduate student point-of-view (as someone who is learning these skills in-situ), Andréa added her expertise in skills needed to manage big data and skills researchers need when working with cloud / high-performance resources, and I contributed to the assessment, and data organization and management section. Twenty-seven of us divided up into breakout groups “A”, “C”, “T”, “G”, “U”, each focusing on a specific part of an introductory genomics lesson that can potentially be used with high school and undergraduate students. And later, using some creative grouping algorithms, we sorted ourselves to work on different sections of the genomics lesson:

  • Data organization and management
  • Working with genomics file types
  • Introduction to the command line
  • Using cloud computing for genomics
  • Data wrangling and processing
  • R for data analysis and visualization


From Shari

From the outset, the hackathon organizers wanted assessment to develop hand-in-hand with the lessons rather than as an afterthought. I facilitated conversations about what could realistically be assessed given the constraints of the workshop setting. We decided it would be fruitful to develop assessments to track change from pre- to post-workshop in three areas—declarative knowledge, skill, and attitudes/dispositions. I rotated among groups and reviewed progress on the hackathon wiki to help translate lesson objectives into one of the three categories and begin crafting survey questions to measure each.

From Blaine

As a second year graduate student just beginning to delve into the field of genomics, I added insight into the target audience mindset, while also being able to contribute in the basic data organization, command line, and data wrangling sections of the lesson plan.  I was staggered by the variety of scientific studies incorporating genomic data and analyses as well as the breadth of questions that are currently being addressed with these novel tools.  Regarding the application of these lesson plans, the issue today is that genomics is such a young field that no one has truly created a curriculum to teach the tools from the beginning – everyone is just learning it on their own.  I think this hackathon will produce an incredible resource that can help to remedy this issue and get a broader audience of biologists using genomic tools in no time.

From Andréa

As a computer engineer with experience improving performance of genomics tools, I contributed to the lesson on using cloud computing for genomics with the goal of enabling scientists to make efficient use of resources other than their personal computers or laptops by running applications in parallel. The idea is to make available a virtual machine with all the tools needed for the data carpentry lessons so that all activities could be carried out using cloud infrastructures such as those provided by Amazon EC2, Microsoft Azure, and iPlant Atmosphere.

Some insights and observations to keep in mind for future effortsscreen shot from Tracy Teal's presentation at Data Carpentry hackathon

It does take an investment of some time to acquire these data and computational literacy skills (Thanks for the screen shot Tracy)! Researchers are enticed to invest the time and effort needed to learn these new skills when they discover how much they can increase their research productivity (in conversations – Emilo Bruna (UF), Greg Wilson (Software Carpentry), and others).

Bioinformatics Resources Australia - European Molecular Biology Laboratory (BRAEMBL) asked scientists about their greatest bioinformatics challenges. Scientists repsonded that they most needed help with expertise and that training would be the most useful thing BRAEMBL could offer them. BRAEMBL survey report. Scientists ask for bioinformatics training.

It's great to be part of the Data Carpentry (DC) family, and know that DC is a collaborative, worldwide, community-driven effort designed specifically to address the area scientist say the need the most help with! Here's to more DC!

What does it take to put together a Data Carpentry module? Who can participate? How does the model work? If you’re new to Data Carpentry, you may want to go to www.datacarpentry.org to find out more.