Resources for using OpenRefine
Why use OpenRefine?
OpenRefine is an open-source tool for manipulating small or large datasets in numerous formats (CSV, JSON, XML, etc.). Because of its low barrier to entry with no prior programming knowledge needed, OpenRefine is an excellent tool for the improvement and maintenance of data integrity for best practices in collections management. Data transformations are reversible and repeatable, and original data are locally preserved. The learning curve for OpenRefine is moderate, with a large community of users and shared knowledge base for help. You can use the resources on this wiki page as a starting point!
Note that this wiki page is an extension of a poster: Krimmel, Erica, & Walker, Lindsay J. (2022). Using OpenRefine for natural history collections data. Society for the Preservation of Natural History Collections (SPNHC), Edinburgh, Scotland, UK. Zenodo. https://doi.org/10.5281/zenodo.6574728
When to use OpenRefine
- For quality control, e.g. to clean recent data entry prior to (or after) database ingestion, or to clean legacy data.
- For combining and manipulating existing datasets, e.g. to transform or integrate your data with external resources like those in a taxonomic authority or Wikidata.
When not to use OpenRefine
- For adding new records individually to an existing dataset, e.g. when transcribing specimen labels.
- For text-heavy one-off data entry, e.g. when typing a sentence in a notes field associated with each row.
- For projects with multiple users on separate computers.
Download OpenRefine from https://openrefine.org.
See links below for our recommended tutorials on how to use OpenRefine. OpenRefine itself maintains a more comprehensive list of externally produced tutorials here, and searching on YouTube and Vimeo will also lead to many relevant videos.
- Data Carpentry lessons: OpenRefine for Natural History Collection Data and Data Cleaning with OpenRefine for Ecologists
- Library Carpentry lesson: OpenRefine
- OpenRefine Walk-through, step-by-step orientation by Javier Otegui using natural history museum data as a subject
- Clean Your Data: Getting Started with OpenRefine, a workshop recording produced by the University of Idaho Library Digital Initiatives (2017-02-15)
- Handouts created for use during the 2019 VRA Annual Conference workshop, Clean, Transform and Enhance Your Data: Download and Install OpenRefine and Getting Started with OpenRefine
- Data Cleaning with OpenRefine, and online short seminar organized by the Harvard Library (2020-06-25)
Reconciliation in OpenRefine allows you to look up values from your dataset in an external source, such as Wikidata. When you reconcile a value, OpenRefine can use fuzzy matching to find multiple possibly options for which value in the external source yours matches with. You can then select the appropriate match and pull additional values into your dataset from the external source based on the match. Learn more about reconciliation from the OpenRefine documentation here, and see this list of additional sources offering reconciliation services through OpenRefine.
Handy GREL scripts
- To trim leading and trailing whitespace, navigate to Edit cells > Transform and use the following code:
- This functionality can also be accomplished without scripting by navigating to Edit cells > Common transforms > Trim leading and trailing whitespace.
- To collapse whitespace, e.g. a double space or a return carriage, navigate to Edit cells > Transform and use the following code:
- This functionality can also be accomplished without scripting by navigating to Edit cells > Common transforms > Collapse consecutive whitespace.
- To add the same text to every selected row in a column with existing values, navigate to Edit cells > Transform and use the following code:
"NEW-TEXT" + value
- To replace text in a column, navigate to Edit cells > Transform and use the following code:
- This functionality can also be accomplished without scripting by navigating to Edit cells > Replace.
- To concatenate values from multiple columns, from the first column navigate to Edit cells > Transform and use the following code:
value + cells["SECOND-COLUMN"].value + cells["THIRD-COLUMN"].value
- You can combine this feature with the ability to add text, e.g.
"NEW-TEXT" + value + cells["SECOND-COLUMN"].value, and this functionality can also be accomplished without scripting by navigating to Edit column > Join columns...
- To convert data in date format into a simplified text format, navigate to Edit cells > Transform and use the following code:
- To bring data from one OpenRefine project into a second OpenRefine project, from the second project navigate to Edit column > Add column based on this column and use the following code:
- To add a facet based on whether or not the values in a row match between two columns, navigate to Facet > Custom text facet and use the following code:
cells["COLUMN-1"].value == cells["COLUMN-2"].value
Join the community
There are many audiences for OpenRefine, and the best community to join is one that aligns with your usage context and skill level. The OpenRefine Google Group is maintained by OpenRefine, and most messages posted are more technical.