Summary of the Taxon Concept discussion 13:30-14:15, Tuesday, 2018-06-05, at the Digital Data in Biodiversity Research Conference, Berkeley
Presented by Campbell Webb (Univ. Alaska, Fairbanks). Title: “Taxon concepts solve a major problem for biodiversity informatics; so why don’t we use them?”
Introduction to the problem
Delivered by Webb
- Webb’s motivation (with Steffi Ickert-Bond, UAF) for this session: current development of informatics infrastructure for a new flora of Alaska that deals explicitly with taxon concepts (TCs).
- The problem: the circumscription of individuals/specimens (via a “specimens seen” list or via character descriptions) associated with a taxonomic name may not be stable through time, as different monographers and flora writers use the name in different ways.
- This is not a new issue. E.g.: Behrendson 1995; discussion at Lisbon TDWG in 2003 that led to the Taxon Concept Schema in 2006 (which states, “Scientific names do not unambiguously identify taxon concepts as represented according to different models of taxonomy”); Nico Franz et al. 2008, and subsequent papers; a workshop held in 2014 here in Berkeley…
- So why discuss this yet again? Answer: the issue remains undiscussed (beyond a small subset of taxonomists) and sometimes unrecognized, and solutions are generally unimplemented in most biodiversity informatics applications. Today’s discussion is intended to be less about the issue itself and more about the reasons for the lack of adoption, and new ways to populate databases that contain TC information.
- To explain the relationship between taxon names and TCs, Webb draws a graphical representation of a changing TC over time.
- Synonym lists are not a general solution to changing circumscriptions; synonyms are imprecise (e.g., in the case of pro parte synonyms). The challenge this presents to species distribution mapping is mentioned.
- The ‘solution’ is clear: to use a Taxon Concept, i.e., name plus (sec. or sensu) a citation of the usage.
- To be most useful, TCs must be presented along with mappings between related TCs: “How does species A sec. person X relate to species A sec. person Y?” There are five basic set relationships: congruent with (
==), includes (
>), is included in (
<), overlaps (
><), is not congruent with (
|). Symbols suggested by Franz et al. 2008. These relationships when used with TCs are Taxon Concept Mappings (TCM), or alignments.
- Examples of TC usage, with TCM: Monika Koperski’s Mosses of Germany (2000); Alan Weakley’s (2015) Flora of the Southeastern US; Denis Lepage’s Avibase.
- Technological solutions not hard: 1) within a relational DB, TC and TCM need only two extra tables; 2) for transport/communication among DBs we have TCS (XML), the TaxonConcept ontology and OpenBiodivO (RDF), and TCRel (DwCA).
- So, two scenarios for biodiversity informatics applications: Scenario 1 - names with synonym lists (the predominant current approach). Scenario 2 - TCs with TCM (rarely implemented).
- Questions for this discussion session:
- Is this actually a serious issue?
- Why is Scenario 2 seldom adopted?
- How to increase adoption of TCs and TCM?
- New (tech.) ways to generate TC/TCM data?
The following notes were made from a recording of the discussion, and from running notes made on the whiteboard. Two general comments from me (Webb) about the discussion: 1) Sorry that I did not ask for audience members to give their names as they were commenting; the comments below are generally unattributed. Many thanks to the speakers for contributing. 2) I did not moderate the discussion well: some people who wanted to speak were not asked to, and some people got a chance to speak a lot. And I spoke too much. Sincere apologies to those who felt they were not heard!
- Audience member (AM): “Since no one is adopting Scenario 2, surely people generally feel Scenario 1 is sufficient.” Webb: “But is some part of the lack of adoption due to lack of awareness of the problem?”
- Show-of-hands poll: This was not a new issue for most people in the room.
- Show-of-hands poll: Almost unanimous agreement that Scenario 1 is not sufficient.
- AM: “If all we have is Scenario 1, we need to engage in messy hacking.”
- Webb: “Why is Scenario 2 not being adopted?”
- AM: “Most local collections systems are being used to manage local specimens, and do not care about competing TCs.” [But what about the larger data aggregators?]
- Brent Mishler: “Even ‘TC+TCM’ is not sufficient (necessary but not sufficient). The reality is that we have all these names on specimens without a TC attached to them. Need a way to map a specimen to a TC.” AM: “Identifications to concepts.” Mishler: “At the meeting we held here in Berkeley, we could only come up with two ways to do this mapping: 1) using characters themselves, 2) authoritative identification (specimens-seen lists). Very hard problem.
- Greg Riccardi (iDigBio): “We enter the name on the label into the DB - don’t even check to see if it’s the right name. TC and TCM is secondary: the main problem is fundamental identification.”
- AM: “On annotation labels seldom is a source (TC) given for the new name. Also, a lot of projects are not recording these annotations - a big extra step beyond basic digitization. Yes, the image may contain the annotations, but new annotations (after first imaging), will require re-imaging.”
- AM (Stinger Guala?): “It’s a return-on-investment issue. In BISON we take the taxon name (only) and link it into the ITIS synonymy. This captures the lumps (TC expansion) and name changes, which is 60% of the problem. This does not capture splits.”
- Brent Mishler: “It’s true that the true lumps are the easiest things to handle, but the modern trend is the other way… there’s a lot of splitting going on (due to molecular work). Perhaps this is a two-phase problem: let go the TC problems of the past, but capture TC information going forward.” [To capture TC info going forward, we do need the informatics infrastructure to be in place.]
- AM: “If we have a ‘gold standard’ for names (it could have been USDA PLANTS, for plants), it becomes easier to go forward, but the problem is that no one will agree on a standard. Also… the problem of mapping old concepts (a book from 1850) to current concepts.”
- AM: “Just identifying the names that have been ‘equal’ for 150 years would be a very important contribution.”
- Webb: “Worth pointing out that the TC/TCM problem is serious only for a minority of taxa (~12% from Weakley’s work on Southeastern plants). This may partly explain why this is not more discussed.”
- Brent Mishler: “From our workshop, the idea emerged that it would be good to tag taxa in a DB: names have not changed vs. splits/etc.”
- AM (Stinger Guala?): “So far talking about plants as a use case. In diatoms a worse problem. E.g., it was recently found (by Ling Lin?) that taxa that had been in Navicula actually belonged in 17 different orders! Higher level TCs in the diatoms have been dramatically changed. Likely to be a similar issue for the fungi.”
- Linda Hardison: “It’s a matter of scale. The flora of Oregon is good size to do comprehensive TCM. We have a golden record for a TC to which nomenclature from other floras is synonymized. Ambiguities are addressed on a specimen-by-specimen basis.”
- AM: “This is (all) new to me. It seems like a huge enough task just to database the biota of the world. Is it meaningful to worry about TC/TCM?”
- AM: “Question for the Flora of Oregon project: What is the time cost associated with this?”
- Linda Hardison: “It’s substantial. A big part of this was the database design at the beginning. You need to address ambiguities, misapplied names, and in part concepts. But very valuable: when an ambiguous name is flagged for a specimen, we don’t map it. As person-time becomes available, we disambiguate these.”
- Webb: “How about the question of how we can use technology to populate TC database?”
- AM: “There are lots of tools available (e.g., ECAT at GBIF) for cleaning names, and especially author strings, which are major parts of the problem. I then advocate for using a GUID for the name string, as in ITIS.”
- Webb: “Opening this discussion out, what sort of ways might you individually integrate TC reasoning into your own work?”
- David Ackerly: “Here’s an ecological angle… JVS requires a taxonomy according to statement, but these are usually to local floras, i.e., local concepts, and not fully resolved (monograph) TCs.” Brent Mishler: “Back again to the need for ecologists to make voucher specimens!”
- AM: “Curious how different Scenario 1 is from Scenario 2; can we use the former to build the latter?”
- AM: “Most taxa are rare, with no controversy about their TCs, and in most of those cases you can translate from Scenario 1 to Scenario 2 pretty easily.
- AM: “If this is a classic 80-20 split, then just flag the 80% without controversy (saving taxonomic research time and mistakes by users).”
- Webb: “Closing thoughts?”
- AM: “What this means to me is that we need to get taxonomists back into collections identifying things. That’s the real solution, rather than always trying to clean things up.”
- AM: “But that’s unlikely to happen. Nor are we going to get an army of taxonomists mapping TCs to other TCs. So it’s not practically possible.”
- Webb: “Should we make more of an effort to let the wider world know about this problem?”
- AM: “[No…] There’s lots and lots of literature on this, but nothing has been solved.”
- AM: “[Yes…] We understand this here, but many users (ecologists) would be lost with these nuances.”
- AM: “Regardless of the final outcome, it seems this process of questioning TCs should be useful for collections management, that flagging things should help reduce the time costs for managers in reassessing specimens’ taxon status.
- AM: “(Response to Vince) We will be revising everything going forwards, so… to do that better, less ambiguously, incorporating flags, lessons learned, more accurate names and concepts, thus producing more accurate names and concepts, is incumbent on us a community.”
Further Taxon Concept discussion at the wrap-up meeting on Wednesday
Notes made by Brent Mishler (Webb did not attend)
- We first distinguished the taxon name problem from the much harder taxon concept problem – amazingly, even many workers in biodiversity informatics don’t appreciate the difference. The two problems are basically orthogonal to each other.
- Taxon names are technically specified only for type specimens – a particular type bears the name given to it forever. Names are regarded as either accepted or as a synonym: the oldest type that falls within a given taxonomic hypothesis (i.e., taxon concept) is the accepted name, other types that fall in the concept are synonyms.
- Taxon concepts are scientific hypotheses about the circumscription of a given taxon. Due to rules in the current codes of nomenclature, a given name can refer to many concepts if there is scientific controversy about circumscriptions. And, different names can refer to the same concept if there is scientific controversy about which types go where.
- Thus every use of a taxon name needs a sensu statement attached to it – meaning “this name is being used in the sense of X taxonomist.” It would be relatively simple to institute such a system going forward, by requiring people to state such in publications. But we have a huge legacy issue with millions of specimens in databases with no sensu statement. Even worse is the millions of sequences in Genbank attributed to a taxon with no sensu statement and no voucher specimen cited!! The literature is also full of such situations, making the interpretation of text mining difficult.
- We need to find a way to at least tentatively assign legacy data to a sensu statement. This is quite difficult because of the many ways concepts can relate to each other. Cam Webb gave a good talk at this meeting on developing schema to relate concepts. But even with such mappings, we still need to decide which specimens should be assigned to which concept. Ideas that have been raised to approach the latter problem include:
- Date the identification was made (in hopes that there was a uniformly accepted concept at that time – certainly not always true).
- An assessment of the diagnostic characters of the specimen (assuming the taxonomist presented such, and that they can be seen in a specimen image).
- Using the specimens known to have been identified by the author of the concept, via annotation history, as a “gold standard” representing her/his taxon concept. This could possibly be combined with the preceding approach by using image processing to find other specimens that match the gold standard. At minimum, users wanting only correctly identified specimens (according to a particular concept) could restrict themselves to the gold standard specimens, perhaps expanding the set of specimens by including those identified by experts who are trusted to be using a particular concept (the “silver standard”).
- Using geographic location in the many cases where taxon splits have resulted in allopatric distributions of segregates (e.g., west and east of the Sierra).
- The PhyloCode, to be published soon (https://www.ohio.edu/phylocode/) could solve a number of these issues going forward in the future, but many of the legacy problems will remain.
- Berendsohn, W. G. (1995) The concept of “potential taxa” in databases. Taxon, 44: 207-212.
- Franz, N., Peet, R. K., and Weakley, A. (2008) On the use of taxonomic concepts in support of biodiversity research and taxonomy. In The New Taxonomy (edited by Q. D. Wheeler), volume 76 of Systematics Association Special Volume, pp. 61-84. Chapman & Hall, London.
- Koperski, M. (2000) Referenzliste der Moose Deutschlands: Dokumentation unterschiedlicher taxonomischer Auffassungen. Number 34 in Schriftenreihe für Vegetationskunde. Bundesamt für Naturschutz, Bonn, Germany.
- Weakley, A. S. (2015) Flora of the Southern and Mid-Atlantic States. University of North Carolina Herbarium, Chapel Hill, NC. http://www.herbarium.unc.edu/flora.htm