Managing Natural History Collections Data for Global Discoverability: Difference between revisions

Jump to navigation Jump to search
 
(71 intermediate revisions by 6 users not shown)
Line 31: Line 31:
The theme of the "Collections Data for Global Discoverability" workshop is ideally suited for natural history collections specialists aiming to increase the "research readiness" of their biodiversity data at a global scale. Have you found yourself in situations where you need to manage larger quantities of collection records, or encounter challenges in carrying out updates or quality checks? Do you mainly use spreadsheets (such as Excel) to clean and manage specimen-level datasets before uploading them into your collections database? The workshop is most appropriate for those who are relatively new to collections data management and are motivated to provide the global research community with accessible, standards- and best practices-compliant biodiversity data.
The theme of the "Collections Data for Global Discoverability" workshop is ideally suited for natural history collections specialists aiming to increase the "research readiness" of their biodiversity data at a global scale. Have you found yourself in situations where you need to manage larger quantities of collection records, or encounter challenges in carrying out updates or quality checks? Do you mainly use spreadsheets (such as Excel) to clean and manage specimen-level datasets before uploading them into your collections database? The workshop is most appropriate for those who are relatively new to collections data management and are motivated to provide the global research community with accessible, standards- and best practices-compliant biodiversity data.


During the workshop essential information science and biodiversity data concepts will be introduced (i.e., data tables, data sharing, quality/cleaning, Darwin Core, APIs). Hands on data cleaning exercises using spreadsheet programs and readily usable and free software will be performed. The workshop is platform independent, and thus will not focus on the specifics of one or the other locally preferred biodiversity database platforms, instead addressing fundamental themes and solutions that will apply to a variety of database applications.
During the workshop essential information science and biodiversity data concepts will be introduced (i.e., data tables, data sharing, quality/cleaning, Darwin Core, APIs). Hands-on data cleaning exercises using spreadsheet programs and readily usable and free software will be performed. The workshop is platform independent, and thus will not focus on the specifics of one or the other locally preferred biodiversity database platforms, instead addressing fundamental themes and solutions that will apply to a variety of database applications.


<!-- We'll discuss and focus on the concepts, skills, and tools we need to share biodiversity occurrence data and related data such as genomics, and media. Datasets will be taken from organismal and evolutionary biology, biodiversity science, ecology, and environmental science. The workshop format includes lectures and hands-on work, so participants are required to bring their own laptops. We will provide information and instructions on any necessary software installations.-->
<!-- We'll discuss and focus on the concepts, skills, and tools we need to share biodiversity occurrence data and related data such as genomics, and media. Datasets will be taken from organismal and evolutionary biology, biodiversity science, ecology, and environmental science. The workshop format includes lectures and hands-on work, so participants are required to bring their own laptops. We will provide information and instructions on any necessary software installations.-->
Line 45: Line 45:
===About===
===About===


'''Instructors (iDigBio):''' Katja Seltmann, Amber Budden, Edward Gilbert, Nico Franz, Greg Riccardi, Deborah Paul, Joanna McCaffrey, Kevin Love, Anne Thessen
'''Instructors (iDigBio):''' Katja Seltmann, Amber Budden, Edward Gilbert, Nico Franz, Greg Riccardi, Deborah Paul, Joanna McCaffrey, Kevin Love, Anne Thessen, David Bloom


'''Skill Level:''' We are focusing our efforts in this workshop on beginners.
'''Skill Level:''' We are focusing our efforts in this workshop on beginners.


'''Where and When:''' Tempe, AZ at the Arizona State University (ASU) School of Life Sciences Natural History Collections, Informatics & Outreach Group in their new [https://www.flickr.com/photos/taxonbytes/15501490652/ Alameda space], September 15 - 16, 2015
'''Where and When:''' Tempe, AZ at the Arizona State University (ASU) School of Life Sciences Natural History Collections, Informatics & Outreach Group in their new [https://www.flickr.com/photos/taxonbytes/15501490652/ Alameda space], September 15 - 17, 2015


'''Requirements:''' Participants must bring a laptop.
'''Requirements:''' Participants must bring a laptop.
Line 68: Line 68:
==Agenda==
==Agenda==
*Managing NHC Data Adobe Connect Room http://idigbio.adobeconnect.com/nhcdata
*Managing NHC Data Adobe Connect Room http://idigbio.adobeconnect.com/nhcdata
*Monday evening, September 14th: pre-workshop informal get-together at [to be decided], from [time to be decided].
*Monday evening, September 14th: pre-workshop informal get-together at Vine Tavern and Eatery, 6 PM.


Schedule - subject to change.
Schedule - subject to change.
Line 80: Line 80:
|-
|-
|8:30-9:15
|8:30-9:15
|Welcome, Logistics, Intro to the Workshop, Why Share Data? Why this workshop?
|[http://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/01_WelcomeWhyThisWorkshop.pptx Welcome, Logistics, Intro to the Workshop, Why Share Data? Why this workshop?]
[http://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/TUE_0830_Session1_WhyThisWorkshop_Pt2_Budden-clean.pptx Why this Workshop?, part 2]
:quick exercise - what are your data challenges? what software do you use?
:quick exercise - what are your data challenges? what software do you use?
:key point - why share data?
:key point - why share data?
Line 86: Line 87:
|-
|-
|09:15-9:35
|09:15-9:35
|General Concepts and Best Practices
|[http://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/TUE_0915_Session2_ConceptsBestPractices_Budden-clean.pptx General Concepts and Best Practices]
:brief introduction to data modeling, the data life-cycle, and relational databases
:the data life-cycle, brief introduction to data modeling, and relational databases
|Ed Gilbert and Amber Budden
|Ed Gilbert and Amber Budden
|-
|-
|9:35-9:55
|9:35-9:55
|Overview of Data standards
|[http://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/03_OverviewOfDataStandards.pptx Overview of Data standards]
:Darwin Core, EML, Audubon Core, GGBN, DwC-A, Identifiers (GUIDs vs local)
:Darwin Core, EML, Audubon Core, GGBN, DwC-A, Identifiers (GUIDs vs local)
|Ed Gilbert, Deb Paul
|Ed Gilbert, Deb Paul
Line 99: Line 100:
:hands-on exercise with occurrence specimen data set
:hands-on exercise with occurrence specimen data set
:data set with known mapping / standardization issues.
:data set with known mapping / standardization issues.
:[http://rs.tdwg.org/dwc/terms/index.htm Darwin Core Terms]
:[https://drive.google.com/file/d/0B0Rlroh4mbthTUFEYTVUU2hZNjQ/view?usp=sharing Sample Data]
:[http://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/SampleDataSetIssues_WorkshopVersion.docx Known Issues in Sample Data]
|All
|All
|-
|-
| style="background-color: #eee;" | 10:30-10:50
| style="background-color: #eee;" | 10:30-10:50
| style="background-color: #eee;" | Break
| style="background-color: #eee;" | Break
| style="background-color: #eee;" | all
| style="background-color: #eee;" |  
|-
|-
|10:50-11:30
|10:50-11:30
|Data Management Planning
|Data Management Planning
:choosing a collection management system, data flow, data backup, field-to-database, metadata
:[https://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/JMc_Tempe_CMSConsiderations.pptx choosing a collection management system], data flow, data backup, field-to-database, metadata
|Amber Budden and Joanna McCaffrey
|Amber Budden and Joanna McCaffrey
|-
|-
|11:30-12:00
|11:30-12:00
|DataONE Lesson 4
|DataONE Lesson 4
:best practices for data entry and data manipulation
:[http://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/TUE_1130_Session6_DataEntry_BestPractices_Budden-clean.pptx best practices for data entry and data manipulation]
|Amber Budden
|Amber Budden
|-
|-
Line 121: Line 125:
|1:00-1:30
|1:00-1:30
|Images and media issues: a brief intro
|Images and media issues: a brief intro
:choosing a camera, issues across different database platforms, image submissions, linking images to occurrence records, batch processing, dams
:[https://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/JMc_Photography101_Tempe.pptx choosing a camera], issues across different database platforms, image submissions, linking images to occurrence records, batch processing, dams
|Ed Gilbert and Joanna McCaffrey
|Ed Gilbert and Joanna McCaffrey
|-
|-
|1:30-2:00
|1:30-2:00
|Digitization workflows and process
|[http://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/07_CommonWorkflows.pptx Digitization workflows and process: Common Workflows and Optimization]
:getting started, prioritization, specimen collecting, new database, integrating old data
:getting started, prioritization, specimen collecting, new database, and integrating old data.
Common Workflows
:Image to data, specimen to data, to-the-web and skeletal records.
:image to data, specimen to data, to-the-web, skeletal records,
:crowd-sourcing, OCR/NLP, georeferencing, metadata
Optimization
:Reviewing your own workflow, common bottlenecks, policy, documentation  
:Reviewing your own workflow, common bottlenecks, policy, documentation  
|Katja Seltmann, Deb Paul & Ed Gilbert
|Katja Seltmann, Deb Paul & Ed Gilbert
Line 151: Line 152:
|GEOLocate Exercise (May be DEMO)
|GEOLocate Exercise (May be DEMO)
:CoGe, GPS Visualizer, re-integration, qc
:CoGe, GPS Visualizer, re-integration, qc
:Folks can preregister to GEOLocate Collaborative Georeferencing using the link below. Doing so will automatically register them for the Phoenix community project that Ed created. If you already have a login, you can use the link to just register ypur existing account to the Phoenix project.
::http://www.museum.tulane.edu/coge/WebComEasySignUp.aspx?ajc=915E2056
|Ed Gilbert
|Ed Gilbert
|-
|-
|4:40-5:30
|4:40-5:30
|Conversation, overview of day, '''volunteers''', preview for tomorrow...
|Conversation, overview of day, preview for tomorrow, backpack logistics for tomorrow, ...
|All
|All
|-
|-
Line 164: Line 167:
|8:30-12:00
|8:30-12:00
|[http://www.dbg.org/ Desert Botanical Garden (DBG) Field Trip] and Lunch
|[http://www.dbg.org/ Desert Botanical Garden (DBG) Field Trip] and Lunch
:meet at 7:55 in Hotel Lobby, depart at 8:00 and 8:30 for DBG; garden from 9-11:30, lunch 11:30 - 12:30, depart 12:40 to ASU
:meet at 7:55 in Hotel Lobby, depart at 8:00 and 8:30 for DBG; garden from 9-11:30, lunch 11:30 - 12:30, aim to depart 12:00 and 12:30 to ASU. Bring a hat!
|  
|  
|-
|-
| style="background-color: #eee;" |12:00-1:00
| style="background-color: #eee;" |11:30-12:30
| style="background-color: #eee;" | Lunch at Gertrude's (in the Garden)
| style="background-color: #eee;" | Lunch at Gertrude's (in the Garden) YUM!
| style="background-color: #eee;" |  
| style="background-color: #eee;" |  
|-
|-
|1:00-1:25
|2:00-2:35
|Welcome Back and Intro to Data Quality
|Welcome Back and Intro to Data Quality
:inside the data-life-cycle, cost of data quality, quality vs completeness
:inside the data-life-cycle, cost of data quality, quality vs completeness
|Amber Budden, Greg Riccardi, (Ed Gilbert)
|Amber Budden, Greg Riccardi, (Ed Gilbert)
|-
|-
|1:25-1:45
|2:35-2:45
|Review Tools for Data Cleaning, Data Manipulation, and Visualization (and Lessons)
|[https://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/02_ReviewTools.ppt Review Tools for Data Cleaning, Data Manipulation, and Visualization] (and Lessons)
:Spreadsheets, Kurator, GPS Visualizer, GEOLOcate, CoGE, Google Maps, CartoDB, Google Fusion Tables, Notepad ++, Open Refine, BioVel, Access,(others), iDigBio recordset data cleaning, iPlant TNRS, RegEx
:Spreadsheets, Kurator, GPS Visualizer, GEOLOcate, CoGE, Google Maps, CartoDB, Google Fusion Tables, Notepad ++, Open Refine, BioVel, Access,(others), iDigBio recordset data cleaning, iPlant TNRS, RegEx
:Where do they fit in your workflow?
:Where do they fit in your workflow?
|Deb Paul
|-
|-
|1:45-2:00
|2:45-2:50
|Data Cleaning
|[https://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/03_DataCleaningWorkflows145PM-Wednesday.ppt Data Cleaning]
:where, when and how does it happen?, what kind of feedback to expect
:where, when and how does it happen?, what kind of feedback to expect
:types of common errors and omissions, best practices strategies, feedback and annotation, error tracking, automation, policies and protocols  
:types of common errors and omissions, best practices strategies, feedback and annotation, error tracking, automation, policies and protocols  
|Deb Paul & Katja Seltmann
|Deb Paul & Katja Seltmann
|-
|-
|2:00-2:50
|2:50-3:40
|Data Cleaning Exercise I
|Data Cleaning Exercise I
:(opt: quick exercise - spot the snafus)
:better spreadsheet skills (Data Carpentry)
:better spreadsheet skills (Data Carpentry)
|Deb Paul & Katja Seltmann
:http://idigbio.github.io/spreadsheet-skills/00-intro.html
|Katja Seltmann & Deb Paul
|-
|-
| style="background-color: #eee;" | 2:50-3:10
| style="background-color: #eee;" | 3:40-4:00
| style="background-color: #eee;" | Break
| style="background-color: #eee;" | Break
| style="background-color: #eee;" |
| style="background-color: #eee;" |
|-
|-
|3:10-3:40
|4:00 - 5:00
|Data Cleaning Exercise I
:better spreadsheet skills (Data Carpentry), continued...
|Katja Seltmann & Deb Paul
|-
|5:00-5:15
|Data Cleaning Exercise II
|Data Cleaning Exercise II
:Open Refine, part I (facets, clustering)
:Open Refine, part I (facets, clustering)
|Deb Paul & Katja Seltmann
:https://idigbio.github.io/open-refine/00-getting-started.html
|-
:https://wiki.biovel.eu/display/doc/Installing+and+running+DR+Workflow+on+Taverna+Workbench#InstallingandrunningDRWorkflowonTavernaWorkbench-InstallingGoogleRefine
|3:40 - 4:00
:http://multimedia.journalism.berkeley.edu/tutorials/google-refine-export-json/
|Feedback: iDigBio recordset data cleaning
|Deb Paul
|Kevin Love & Katja Seltmann
|-
|-
|4:00-5:00
|5:15-5:30
|Conversation, overview of day for context and questions, '''homework''' and preview for tomorrow...
|Conversation, overview of day for context and questions, '''homework''' and preview for tomorrow...
|Deb Paul & Katja Seltmann
|Deb Paul & Katja Seltmann
|-
|Evening Activity (opt)
|Insect Collecting Opportunity
:Sign Up and Details - [http://asu-entomology.wikispaces.com/Fall+2015+Collecting Wednesday night insect collecting trip to Mesquite Wash] <br/> Pictures Please!
|Host - Nico Franz
|-
|-
!colspan="3"| Course Overview - Day 3 - Thursday September 17th
!colspan="3"| Course Overview - Day 3 - Thursday September 17th
|-
|-
|8:45-9:00
|8:30-9:00
|Discussion of Material Covered so far and Overview of Day 3
|Discussion of Material Covered so far, Overview of Day 3, Set up breakout groups
|Katja Seltmann
|Katja Seltmann
|-
|-
Line 219: Line 232:
|Potential break out groups
|Potential break out groups
:Taxonomic Names issues - TNRS,ECAT
:Taxonomic Names issues - TNRS,ECAT
:GEOLocate,CoGe,
:GEOLocate, CoGe, Georeferencing Workflows, Workshops
:Data Cleaning: what is scripting? what is regex? examples in Open Refine, possibly in Symbiota  
:Data Cleaning: what is scripting? what is regex? examples in Open Refine, possibly in Symbiota  
:your own data issues / requests
:your own data issues / requests
Line 229: Line 242:
:DataONE Data Management Planning Tool  
:DataONE Data Management Planning Tool  
:What is Data Carpentry?
:What is Data Carpentry?
:Text Editors
:rAPI
|All
|All
|-
|-
Line 247: Line 262:
|-
|-
|11:20-11:40
|11:20-11:40
|Getting Your Data Published: Sending Data to iDigBio
|[https://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/JMc_Tempe_DataToiDigBio.pptx Getting Your Data Published: Sending Data to iDigBio]
:from you to us, the details, the options
:from you to us, the details, the options
|Joanna McCaffrey
|Joanna McCaffrey
Line 256: Line 271:
|-
|-
|1:00-1:45
|1:00-1:45
|iDigBio Portal Exercise
|Feedback from iDigBio as part of the Data Life Cycle and an iDigBio Portal Exercise
:Using iDigBio portal to do something with data that can’t be done within a local system
:[[Media:Recordset-cleaning.pdf| iDigBio Data Management and Recordset Data Quality]]
::Ex. PhyloJive, or regex, or LifeMapper Demo
:[https://www.idigbio.org/content/improving-data-quality-idigbio-recordset-data-cleaning-method-tools-and-data-flags Webinar coming up - Improving Data Quality: iDigBio Recordset data cleaning method, tools, and data flags] October 23th, 2015
|Katja Seltmann
:Using the iDigBio Portal and integrated research tools (PhyloJive, LifeMapper)
:https://goo.gl/gyRwx7
:http://idigbio.github.io/spreadsheet-skills/09-iDigBio-portal.html
|Kevin Love, Katja Seltmann and Deb Paul
|-
|-
|1:45-2:05
|1:45-2:05
|Copyright / Intellectual Property
|[https://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/Law%26Ethics_DataDiscovery_iDigBio.pptx Copyright / Intellectual Property]
:VertNet [http://vertnet.org/resources/datalicensingguide.html Guide to Copyright and Licenses for Dataset Publication]
:VertNet [http://vertnet.org/resources/datalicensingguide.html Guide to Copyright and Licenses for Dataset Publication]
::[http://vertnet.org/resources/norms.html VertNet Norms]
:[https://www.idigbio.org/content/idigbio-terms-use-policy iDigBio Terms of Use and Citation]
|David Bloom, Jonathan Rees, Greg Riccardi
|David Bloom, Jonathan Rees, Greg Riccardi
|-
|-
Line 270: Line 290:
| style="background-color: #eee;" |
| style="background-color: #eee;" |
|-
|-
|3:20-4:30
|3:20-5:00
|Second round of break-out groups
|Second round of break-out groups
:DWC-A publishing Exercise (or DEMO): using IPT instance OR
:DWC-A publishing Exercise (or DEMO): using IPT instance
::[https://www.idigbio.org/sites/default/files/workshop-presentations/managing-nhc-data/sample-data/sampleoccurrence_dupfixed.txt Sample Dataset]
::your email and "password"
:::http://iptworkshop.idigbio.org/ (your email prefix)
:Symbiota DwC-A mapping and publishing exercise,  
:Symbiota DwC-A mapping and publishing exercise,  
:others
:others
|Edward Gilbert
|Edward Gilbert
|-
|-
|4:30-5:30
|5:00 -5:30
|Closing topics
|Closing topics
:a greater network, the global landscape, next steps
:What are your next steps for moving forward
:review Data Life Cycle we’ve walked through
:guided discussion, survey, and thanks!, ...
:guided discussion, survey, and thanks!, ...
|Katja Seltmann & Nico Franz, all
|Katja Seltmann & all
|-
|-
|}
|}
Line 292: Line 314:
**[[Media:restaurantlist.pdf|List of Restaurants]] (pdf)
**[[Media:restaurantlist.pdf|List of Restaurants]] (pdf)
*[https://www.idigbio.org/content/managing-natural-history-collections-data-global-discoverability Workshop Calendar Announcement]
*[https://www.idigbio.org/content/managing-natural-history-collections-data-global-discoverability Workshop Calendar Announcement]
*Participant List
*[https://docs.google.com/spreadsheets/d/1P-8arv0aEG5Koo3uqSywY24t_L0mUXDsQ1mik0e7q10/edit?usp=sharing Participant List]


===Adobe Connect Access===
===Adobe Connect Access===
Line 299: Line 321:


==Workshop Documents, Presentations, and Links==
==Workshop Documents, Presentations, and Links==
*Google Collaborative Notes
*[https://docs.google.com/document/d/1uVKwl7BR_G_5iIIazHHW0YLPVSgWqAftWu_F5c7WCpA/edit Google Collaborative Notes]
**These are notes with benefits.
*links to any presentations (like power points) here
*links to any presentations (like power points) here
*[http://rs.tdwg.org/dwc/terms/ Darwin Core Terms]
*[http://rs.tdwg.org/dwc/terms/ Darwin Core Terms]
Line 316: Line 339:
==Workshop Recordings==
==Workshop Recordings==
====Day 1====
====Day 1====
*8:30am-10:15m
*8:30am-10:15m http://idigbio.adobeconnect.com/p1w43drdjqp/
*10:45am-11:00am
*1:00pm-2pm http://idigbio.adobeconnect.com/p2jibddghos/
*11:15am-12pm
*3:30-5:30pm http://idigbio.adobeconnect.com/p3rgo6o79wk/
*1:00pm-2:30pm
*3:00-5:00pm


====Day 2 ====
====Day 2 ====
*1:00pm-2:30pm
*2:00pm-4pm http://idigbio.adobeconnect.com/p88qwbfsh33/
*3:00-5:00pm
*4-5:30pm http://idigbio.adobeconnect.com/p6956srj6iu/


====Day 3 ====
====Day 3 ====
*8:30am-10:15am
*10:35am-12:00pm http://idigbio.adobeconnect.com/p38bkyk8uge/
*10:45am-11:00am
*1:00pm-3:30pm http://idigbio.adobeconnect.com/p4fkezirk1j/
*11:15am-12pm
*3:30-5:00pm http://idigbio.adobeconnect.com/p8slidpxlc7/
*1:00pm-3:30pm
*3:30-5:00pm


==Resources and Links==
==Resources and Links==
Line 350: Line 369:
**For example see [http://www.lynda.com/Access-tutorials/Relational-Database-Fundamentals/145932-2.html relational database fundamentals]
**For example see [http://www.lynda.com/Access-tutorials/Relational-Database-Fundamentals/145932-2.html relational database fundamentals]
*You want to share genetic sequence data for your specimens? Are the sequences in a database like GenBank? You can use [http://rs.tdwg.org/dwc/terms/#associatedSequences dwc:associatedSequences field] to share links to the sequences and metadata about them. Note you can soon use the Material Sample Core, and share more complex genomic data using the GGBN extensions, and also use an extension to share the specimen information from which the samples were taken.
*You want to share genetic sequence data for your specimens? Are the sequences in a database like GenBank? You can use [http://rs.tdwg.org/dwc/terms/#associatedSequences dwc:associatedSequences field] to share links to the sequences and metadata about them. Note you can soon use the Material Sample Core, and share more complex genomic data using the GGBN extensions, and also use an extension to share the specimen information from which the samples were taken.
*[https://www.idigbio.org/wiki/images/2/20/Digitization_info_from_the_INHS.pdf INHS digitization system shopping list + info on setup]
*Want to learn SQL?
**http://sqlschool.modeanalytics.com
**http://www.headfirstlabs.com/books/hfsql/
*Teaching Reproducible Research best practices
**[http://reproducible-science-curriculum.github.io/2015-06-01-reproducible-science-idigbio/#schedule Reproducible Research Workshop]
*STEM - What can you do to help?
**Read! [http://www.aauw.org/research/why-so-few/ Why So Few?]


==[[Digitization Training Workshops|Digitization Training Workshops Wiki Home]]==
==[[Digitization Training Workshops|Digitization Training Workshops Wiki Home]]==
1,390

edits

Navigation menu