Resources for using OpenRefine

From iDigBio
Revision as of 20:02, 19 July 2022 by Ekrimmel (talk | contribs) (→‎Why use OpenRefine?)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

OpenRefine logo color.png

Why use OpenRefine?

OpenRefine is an open-source tool for manipulating small or large datasets in numerous formats (CSV, JSON, XML, etc.). Because of its low barrier to entry with no prior programming knowledge needed, OpenRefine is an excellent tool for the improvement and maintenance of data integrity for best practices in collections management. Data transformations are reversible and repeatable, and original data are locally preserved. The learning curve for OpenRefine is moderate, with a large community of users and shared knowledge base for help. You can use the resources on this wiki page as a starting point!

Note that this wiki page is an extension of a poster: Krimmel, Erica, & Walker, Lindsay J. (2022). Using OpenRefine for natural history collections data. Society for the Preservation of Natural History Collections (SPNHC), Edinburgh, Scotland, UK. Zenodo. https://doi.org/10.5281/zenodo.6574728

When to use OpenRefine

  • For quality control, e.g. to clean recent data entry prior to (or after) database ingestion, or to clean legacy data.
  • For combining and manipulating existing datasets, e.g. to transform or integrate your data with external resources like those in a taxonomic authority or Wikidata.

When not to use OpenRefine

  • For adding new records individually to an existing dataset, e.g. when transcribing specimen labels.
  • For text-heavy one-off data entry, e.g. when typing a sentence in a notes field associated with each row.
  • For projects with multiple users on separate computers.

Getting started

Download OpenRefine from https://openrefine.org.

Basic tutorials

See links below for our recommended tutorials on how to use OpenRefine. OpenRefine itself maintains a more comprehensive list of externally produced tutorials here, and searching on YouTube and Vimeo will also lead to many relevant videos.

Reconciliation

Reconciliation in OpenRefine allows you to look up values from your dataset in an external source, such as Wikidata. When you reconcile a value, OpenRefine can use fuzzy matching to find multiple possibly options for which value in the external source yours matches with. You can then select the appropriate match and pull additional values into your dataset from the external source based on the match. Learn more about reconciliation from the OpenRefine documentation here, and see this list of additional sources offering reconciliation services through OpenRefine.

Scripting

OpenRefine enables a variety of options for using scripting to extend the functionally of its graphical user interface. The main scripting languages used are General Refine Expression Language (GREL), which is similar to Excel string formulas, and Python. It can also be helpful to familiarize yourself with JavaScript Object Notation (JSON), which is a common data-interchange format.

Handy GREL scripts

  • To trim leading and trailing whitespace, navigate to Edit cells > Transform and use the following code:
value.trim()
This functionality can also be accomplished without scripting by navigating to Edit cells > Common transforms > Trim leading and trailing whitespace.
  • To collapse whitespace, e.g. a double space or a return carriage, navigate to Edit cells > Transform and use the following code:
value.replace(/\s+/,' ')
This functionality can also be accomplished without scripting by navigating to Edit cells > Common transforms > Collapse consecutive whitespace.
  • To add the same text to every selected row in a column with existing values, navigate to Edit cells > Transform and use the following code:
"NEW-TEXT" + value
  • To replace text in a column, navigate to Edit cells > Transform and use the following code:
value.replace("EXISTING-VALUE","NEW-VALUE")
This functionality can also be accomplished without scripting by navigating to Edit cells > Replace.
  • To concatenate values from multiple columns, from the first column navigate to Edit cells > Transform and use the following code:
value + cells["SECOND-COLUMN"].value + cells["THIRD-COLUMN"].value
You can combine this feature with the ability to add text, e.g. "NEW-TEXT" + value + cells["SECOND-COLUMN"].value, and this functionality can also be accomplished without scripting by navigating to Edit column > Join columns...
  • To convert data in date format into a simplified text format, navigate to Edit cells > Transform and use the following code:
value.toDate().toString('YYYY-MM-dd')
  • To bring data from one OpenRefine project into a second OpenRefine project, from the second project navigate to Edit column > Add column based on this column and use the following code:
cell.cross("NAME-OF-PROJECT-1", "PROJECT-1-COLUMN-TO-MATCH-ON")[0].cells["PROJECT-1-COLUMN-TO-GET-VALUE-FROM"].value
  • To add a facet based on whether or not the values in a row match between two columns, navigate to Facet > Custom text facet and use the following code:
cells["COLUMN-1"].value[0] == cells["COLUMN-2"].value[0]

Join the community

There are many audiences for OpenRefine, and the best community to join is one that aligns with your usage context and skill level. The OpenRefine Google Group is maintained by OpenRefine, and most messages posted are more technical.