Citizen Science Data Integration

One of the projects I’m working on now is integrating data about North American butterfly observations collected by about a dozen different citizen science butterfly monitoring programs. The people who collect the data are volunteers who are assigned a specific route and keep track of all the butterflies they observe while walking along the route. Some observations have a latitude, a longitude, and a date. Some observations have a country, state, county, and a date. If someone wanted to find out if the abundance or distribution of a specific species of butterfly has changed over time, he/she could group the data sets by butterfly name to get the data for the analysis. 

Easy, right? Well, no.

Each monitoring program has its own list of species names for its volunteers to use and every list is different. This can happen for several reasons. The simplest, is that these monitoring programs are regional, so a volunteer in Colorado will not see the same butterflies as a volunteer in Louisiana. For the sake of simplicity, a monitoring program will not have species on their list that volunteers are not likely to see. Three other scenarios make lists difficult to integrate. First, two different monitoring programs may see the same butterfly, but disagree on what that butterfly is called. This actually isn’t so bad. We can say that butterfly X according to program A = butterfly Y according to program B. Problem solved. A more difficult problem is when one program differentiates two species that another program lumps together. This can happen either because a program may not recognize the existence of a species or subspecies that other programs do recognize or because two species might be too difficult for volunteers to differentiate. This is harder because if someone wanted to integrate data for one of the lumped species, there would be no way to differentiate individual data points in a data set from a program that lumped. The user would have to accept the lumped group and lump data from other programs in the same way or throw out the lumped data. The third problem is the trickiest. What if programs disagree about how to define the species themselves, not just what to call them? Let’s imagine that there are 50 individual butterflies (Fig. 1). One butterfly expert may determine that these individuals may be divided into two species. Another expert may divide them into four species and a third may agree that there are two species, but divide up the individuals differently from the first expert. Without extensive metadata about the individual that was observed, there’s very little that can be done to effectively integrate data sets with these discrepancies. In addition, monitoring programs may not make available their criteria for labeling a specific butterfly with a specific name. One generally assumes that if two programs are using the same name, that they are using it in the same way.

Butterfly Species

Figure 1: Fifty butterflies divided into species by three experts. Expert 1 (orange boxes) thinks there are two species present. Expert 2 (green boxes) thinks there are four species present. Expert (purple boxes) thinks there are two species present, but has drawn the boxes differently than Expert 1.

Back to my project. How can we manage these discrepancies between lists to make an integrated North American butterfly monitoring data set? The first step of our plan is to map each list to every other list. A user could then say “I have an observation of butterfly C by monitoring program 1. What is its equivalent in monitoring program 2?”. The exact nature of the mapping will have to wait until we know exactly how many and what kinds of difficult situations we have to cope with. I will make a pairwise comparison of each list using the newly developed Taxonomic Tree Tool. The tool was developed by an EOL Rubinstein Fellow and it produces some interesting visualizations of the differences between classifications.

Developing a unified list will undoubtedly be a lot of difficult work. Wouldn’t it be easier if these programs agreed on and used a single list? Maybe the unified list produced by this project? It would most definitely be easier for data managers, but not necessarily for the projects and their volunteers. These species lists have been in place for years, sometimes decades, and expecting the programs to change is unrealistic. A much better, long-term, solution is to make our mappings available. That way no one else will have to repeat this work and small changes can be added over time to keep the mappings up-to-date.

I want to find a good way to not only share the taxonomy, but to make it easy to update as our knowledge of butterflies improves. There seems to be no ideal way to do this, but there is one okay way to do it. Ideally, there would be a web service where a volunteer from one monitoring program could submit a list of names that they are using and get back a list of the corresponding names from another monitoring program. Changing the mappings could be done online by the monitoring programs directly. Theoretically, this is already possible through a tool called GNRD. This tool is a web service that can take a list of names and return the “correct” name according to a user-picked authority. Right now the available authorities include large nomenclators and aggregators like Catalogue of Life and the Encyclopedia of Life. In our case, each butterfly monitoring program could also be an authority. If we model the reconciled groups of names in a Darwin Core Archive, GNRD can use that file to map names. At the same time, I can submit the archive to GBIF as a taxonomic data set. This plan will make the list available and usable, but does not provide good way to update the mappings over time. Updates would have to be made by me changing the Darwin Core Archive.

Mapping species lists that are actively being used by citizen science monitoring programs will be a difficult, but important task. My colleagues on this project and I will publish this work, but I want to think beyond publishing to find a way to keep this mapping useful into the future.