CERF logo

Scrum Your Resource Management

Last week I attended the Coastal and Estuarine Research Federation meeting in Providence, RI, USA. CERF meets every other year and I’ve been going fairly regularly since 2001. It is a very interdisciplinary meeting attended by people from all aspects of coastal science and management. In addition to being an exhibitor, I gave a talk and a workshop, and I attended as many talks as I could.

I heard several talks about adaptive resource management, which was new to me. I was struck by how similar adaptive management is to agile programming.

Adaptive management is a way to achieve a management goal (like reduced nutrients in a wetland) by collecting data and using that data to iteratively create policy in the face of uncertainty. This contrasts with a more linear path where managers work toward a goal without making iterative adjustments. The latter method can have managers working toward a goal for much longer before taking stock of whether or not their plan is working. Below is a diagram showing the cycle of adaptive resource management. (By Conservation Measures Partnership – Open Standards for the Practice of Conservation, Public Domain, https://commons.wikimedia.org/w/index.php?curid=6925415)

adaptive management diagram

Agile programming is a method of iterative software development where products are created incrementally with iterative input from stakeholders. The data used to make adjustments at every iteration are input from the customer, performance of the product, and information about how the development team is working. This is contrasted with the waterfall method, wherein requirements and goals are decided in the beginning and developers make software with no further input from stakeholders. Below is a diagram showing the cycle of Agile software development. (From gcreddy.com)

agile development diagram

These two systems are similar in that they focus on a cyclical process of decision making based on regular inputs, which is very helpful in a dynamic world, whether it be an estuary or the marketplace. Every iteration provides new information that can be used to solve problems and make better decisions in the next iteration. Both systems acknowledge that while the end goal may not change the most efficient path to that end goal may change. Stakeholder input is very important to both processes. The success or failure of each system is wholly dependent on proper implementation and due diligence from the start to agree on requirements, end goals, and standards.

Of course, both systems differ in their specific implementation because of their two very different contexts. Also, agile software development iterates over much shorter time periods than adaptive management, as short as every week, with daily check-ins called “stand ups”. Adaptive management usually iterates over an annual cycle, mostly because of the importance of seasonality to many life processes.

I think adaptive resource management is a good step toward being able to effectively care for important resources in a changing world. I enjoyed seeing old friends and colleagues at CERF and catching up on the latest developments in coastal science.

table missing leg

Biodiversity and the Big Data Table

Given the scale and heterogeneity of data about species and their environments, big data and semantic applications show promise for getting over the scalability hurdle inherent in addressing global scale biodiversity problems. Despite some important advances, these technologies have yet to reach their full potential in the biodiversity and environmental science disciplines. Why?

A fully functional big data “table” in biodiversity science requires maturation of four “legs”: 1. high-performance computing that can host large data sets and facilitate their analysis, 2. mass digitization of standardized data, 3. development of standards and ontologies, and 4. user interfaces that lower barriers for non-technical users. All four of these “legs” must be the same length to make a functioning table. For example, a fully developed system to host data will not be useful if there are no standardized, digital data to host. The best computing system in the world will not have many users if the interface is frustrating. The effectiveness of investments in one leg can be limited by a lack of investment in another. Currently, unequal investments in the four legs have resulted in a lop-sided table. Researchers are left telling skeptical users, “This will be a really great table one day, trust me.”

What can we do about it?

Most informaticists who work in biodiversity and environmental science know this is a problem, but are limited in how they can respond. Many of the important tasks of building the legs are not considered worthy of funding because they are not hypothesis-driven science. There aren’t many ways to fund this sort of work directly, despite its importance. Some funders realize that investments in infrastructure are worthy, but resources are still very limited.

The best way to move forward is incrementally. Instead of building each of the four “legs”, one at a time, sequentially, build the table up in short iterations, lengthening each leg in concert with the other. At the end of the iteration, there is a functioning “table” that will delight users on a much faster time scale. Then start on the next iteration and lengthen all the legs a bit more. This is the best way to hold a user’s attention, have a quick return on investment, and manage their expectations. Then, instead of saying “Trust me”, a researcher just has to say “Try it”.

Automated Ontology Building in Ecology

One of the more difficult aspects of trying to apply “big data” thinking in ecology is the massive heterogeneity of terms. I stumble over this issue every time I work on a data set for the Encyclopedia of Life. The many different ways to describe the same habitat (among other things) and the varying granularity with which people describe habitats make it very difficult for data consumers to find, for example, all the beetles that live in the desert. It’s doubly more difficult to go a step further and ask for traits of beetles that live in deserts, like color, for example.

As a side note, that example is very similar to some use cases I published with several colleagues about ways to combine phenotype and environment data.

Right now, we can ask Google “How much does a narwhal weigh?” and get the answer because of the fine work my EOL colleagues and I have been doing on TraitBank (go ahead, try it), but we’ve still got a way to go before we can ask “What color are beetles that live in the desert?”. We have a plan, though, and it involves semantic technology, i.e. ontologies.

Biology already has many ontologies available for use of varying quality. Most of them can be found at OBO Foundry. Not all domains of biology have good ontologies available, for example, ecology has been left out. That means there is no standard, machine-readable way of expressing which organisms are autotrophs, or nocturnal, or use camouflage, etc. Including terms such as these in an ontology is one of the many necessary steps before we can ask “Which organisms are nocturnal in an alpine forest habitat?” or, if we want to get more complicated, “Is there a relationship between the phylogeny of terrestrial, nocturnal organisms and latitude or elevation?”.

Building an ontology is a large, never-ending, hugely complicated task. One of my clients at University Colorado, Boulder, is the ClearEarth project. The goal of this project is to repurpose NLP and ML algorithms developed for biomedicine for use in geology and biology. These algorithms can read text and automatically generate ontologies. We’ve made a lot of progress annotating domain-specific text and will have some “auto-ontologies” by this summer. Very exciting! To support this effort and make sure the ontologies resulting from this project are meshed in with existing bio-ontologies, we are hosting an “ontology-a-thon” in Boulder this summer. Please take a look and apply, if you are interested in participating. We don’t have a detailed agenda just yet, but the idea is to get ontology and ecology experts in one room to curate the auto-ontology. All expenses paid, but space is limited.

sea otter

Keystone Predators and Centrality: Ecosystem as Social Network Part 2

My last post looked at a very small, but well studied rocky intertidal ecosystem and was able to identify a keystone predator (Pisaster) in a network using centrality measures. I was worried, though, that this method would not work on a larger, more complicated system. Let’s try these same calculations on a slightly larger kelp forest ecosystem. These systems are commonly found on the west coast of continents and are characterized by the presence of large kelps. The sea otter, Enhydra lutris, is an important predator of herbivores (e.g., sea urchins) that eat macroalgae. In the absence of sea otters, sea urchin populations explode and overgraze the kelp. Will centrality measures be able to identify the sea otter as a keystone predator? In this network, I had trophic interactions, competition interactions, and new “habitatFor” interactions that described a relationship between two taxa wherein one provided habitat for the other. My initial list contained 69 interactions and the centrality measures were all pointing to the kelp as the keystone species. This is likely because the kelp provided habitat for nearly every species in the kelp forest.

This raises an interesting question regarding our definition of keystone species. Without the kelp there is no kelp forest and that’s why the centrality measures pointed to the kelp, but the sea otter is thought of as the keystone species is this system. An important part of the definition of a keystone species is its relative abundance. Keystone species are supposed to have a disproportionate effect on the ecosystem relative to its abundance. The kelp have a very large effect, but are also very abundant. The sea otters have a large effect and are nowhere near as abundant as the kelp. That is what makes the sea otter the keystone species and not the kelp. I can’t help but think that otters being cute and cuddly while kelp are cold and slimy has something to do with it.

An algorithm that identifies kelp as a keystone species of a kelp forest is not very helpful. The kelp are more of a foundation species. How can we identify the sea otter as a keystone species even though the kelp are far more influential? One strategy that is most direct is to include the relative biomass of each taxon, but this is often not known and not included in databases of networks and interactions. I am going to try and find a way to make the network calculations work, but the results of the various network measures are not very helpful (most point to the kelp as the most important) except for closeness vitality, which is highest for the sea otter. When I do the calculations on a network made up of only the trophic interactions, as I did with the rocky intertidal system, the sea otter comes out on top in all the centrality measures. This supports the importance of dividing the network by interaction type before analysis.

Two additional issues come to mind:

  • How can I compare centrality measures across networks with different numbers of nodes, edges, and different degrees of connectivity?
  • How does the size and granularity of a network affect the results of the connectivity calculations?

The first issue is relatively straightforward. The calculation results can be normalized against the highest value; thus, the highest result for each network is always 1. When I do this normlization, the values for Pisaster and the sea otter are both 1 and thus comparable.

To explore the second issue, I played a few games with the interactions in the kelp forest ecosystem. In the original list of 69 interactions, I have some that are a bit repetitive:

  • Enhydra lutris, eats, Strongylocentrotus franciscanus
  • Enhydra lutris, eats, Strongylocentrotus purpuratus
  • Enhydra lutris, eats, Strongylocentrotus droebachiensis

Strongylocentrotus is a genus of sea urchin. Each species of sea urchin is listed as eating the same five species of macroalgae. So, the network has three nodes (the three Strongylocentrotus nodes) with identical edges. What happens to the results if I collapse these three identical species nodes into one genus node? The answer is not much. The kelp still has the highest connectivity in the network containing all the interactions and the sea otter still has the highest connectivity in the network with only trophic interactions. In the end, I collapsed the urchins into one genus, but the macroalgae was grouped by annual kelp and perennial kelp. Clearly, I need to develop some guidelines for lumping nodes consistently. Considering the high degree of taxonomic change in some groups, having genus- or family-specific nodes may be more desirable than species-specific nodes. In some cases a node defined by function instead of taxonomy may be better.

The data files for this work can be found in the github repo.

The sea otter image is CC-BY-NC from Biopix.

Keystone Predators and Centrality: Ecosystem as Social Network Part 1

A few weeks ago I announced a project that would train an algorithm to recognize important taxa in an ecosystem using the characteristics of species interactions within that ecosystem. This post documents the first bit of work I’ve done. I’ve made a github repo with data and code. I’m using Python 2.7 with NetworkX.

First I thought I would start with a simple, well-studied ecosystem, the rocky intertidal system made famous by Robert T Paine. This system has eight taxa with 24 relationships between them. This system has a keystone predator, the starfish Pisaster. All the interactions in this network are either “eats” or “competes”. If I run a medley of calculations over this network I notice pretty quick that there’s no super clear way to pick out Pisaster as the keystone predator. There are also several parameters that don’t seem to be all that useful for this application. At least I got the code to run, though. I was able to make the data file, use it to make a graph, visualize the graph, and do some calculations.

I tried doing the calculations again, only I left out some of the less helpful calculations and separated out the “eats” interactions. Since Pisaster is a keystone predator, I thought examining the trophic relationships separately might be worth a try. These results were much more interesting because Pisaster has the highest value for five centrality measures. Centrality might be a way to identify keystone predators. There are others who have also had this thought (here and here), so I feel confident I am on the right track.

My next worries:

  1. Is this ecosystem too small to tell me anything real? I need to work on a larger, more complicated network to see if this pattern holds up.
  2. Can I use the method of separating out a specific type of interaction to identify other types of important species, such as ecosystem engineers or keystone pollinators? I need to find well-studied ecosystems that have other types of important species.

Featured image by D. Gordon E. Robertson – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6434467

EarthCube in Denver

Squishy EarthCubeI was invited to give a keynote presentation at the EarthCube All-Hands Meeting in Denver last week. EarthCube is a project funded by the US National Science Foundation to build data infrastructure for geoscience. Every year they have an “all-hands meeting” for all of the people working on EarthCube projects to get together and discuss their progress. I was invited to give a presentation about lessons learned building community-driven cyberinfrastructure in biology. The presentation was about my experience working with the Encyclopedia of Life from when I first started (two months after the website first went up) to present. I was happy to do it and I think the participants found it useful. I was most interested to see where EarthCube was after having been involved in a few of the initial planning sessions a several years ago. There were several EarthCube projects that I found to be very interesting, but one in particular made me smile and was totally unexpected (for me anyway).

One of the students wrote an app called FlyoverCountry. It is meant to be installed on a mobile device and used while flying. This app links data from five data sources, Macrostrat, NeotomaDB, PaleobioDB, Wikipedia, and LacCore. While a user is flying over the countryside, the app can tell them what rock formations or fossils can be found down below. With just a little imagination, one could come up with more data sources that could be linked in this app. FlyoverCountry has gotten a lot of attention and was featured on NPR’s Science Friday (July 8) and the Smithsonian magazine, among others. What really gets me excited, though, is the behind-the-scenes linking of these geoscience databases. This is the kind of linking that EarthCube should be about. I realize that EarthCube is supposed to be data infrastructure serving geoscientists, not the curious public, but an app like this can bring in the support needed to build the behind the scenes infrastructure the research community needs.

The EarthCube project has another five years left (at least if they stick with the original expectation of a ten year project) and I hope I get invited to another all-hands meeting so I can see what further progress has been made.

network icon

Ecosystem as Social Network

Lately, I’ve been thinking about how interactions between organisms in an ecosystem can be represented as a graph, with nodes and edges, similar to a social network. The nodes represent an organism or group of organisms while the edges represent the relationship between them. For example, a graph representation of an African savanna ecosystem would have lion and zebra as two nodes connected with an edge representing a predator/prey relationship. I began to wonder if some of the methods for gaining insights in social networks could be applied in ecology. There is an entire field of mathematics devoted to analyzing networks. These analyses can identify things like important nodes and sub-networks. Could I use these maths to identify important species? If the answer is yes, can I then use these derived characteristics to train a learner to identify important species in a database of interactions?

Interesting questions. Where to start? Fortunately, there is a major species interaction database called GloBI that I can use as a data set. There are several ecosystems that have been extensively studied and have data about the relative importance of taxa that I can use as training and test data sets. I’m not the first person to think along these lines. Lundren and Olesen, Estrada and Bodin, Gonzalez et al., Dunne et al., Steinhaeuser and Chawla, and Jordan et al. have all published studies looking at network structure of ecosystem graphs. Their work gives me some hope that this might actually work. My contribution will be designing a learner that can identify important taxa in a database of interactions not expressly designed for this purpose.

I will be doing the analysis using the Python networkx library. I would like to focus on ecosystems of different sizes, granularity, and types for training and testing. I want to capture important predation, habitat creation, and pollination interactions. I think I’ll start with rocky intertidal ecosystems and yellowstone national park. Both systems have been well studied. The first task will be getting the data in a usable format for analysis. Stay tuned!

Trickle Down Attribution

Last week I was in Portland, Oregon attending the annual meeting of Force11, a community interested in the future of research communications. There were many great speakers and panel discussions, but what interested me the most was the unveiling of OpenVIVO. Anyone with an ORCiD can “claim” their OpenVIVO profile. I logged in using my ORCiD and my research output was instantly imported into my OpenVIVO profile. As new works were added, I was asked to claim my role in creating them. These roles went far beyond traditional authorship. I could get specific credit for data curation, graphic design, being the equipment technician, and many other roles by clicking on check boxes. All of these roles were part of the VIVO-ISF ontology that helps standardize contribution types across institutions and disciplines.

I have an OpenVIVO profile that lists publications and data sets, but my profile information doesn’t stop there. Each publication and data set has an Altmetric badge. Here is an example from one of my more widely tweeted works. The badge is the “rainbow donut” in the upper right. Clicking on the donut will take you to a summary page at Altmetric that gives more information about how people have been interacting with my publication. Altmetric creates these colored donuts using data from 15 different “sources of attention”. The number in the middle of the donut is automatically calculated as a weighted count of all the attention the research product has received. It is hard to know the true meaning of these metrics, but I’m interpreting them as a measure of immediate interest. Time may prove otherwise, but I consider research products that received more attention to be more interesting to the community. I can get this information for publications and data sets, but what about individual data points?

Part of the work that I do for the Encyclopedia of Life involves EOL TraitBank, a semantic database for species traits. The TraitBank data model separates individual data points so that data sets can be pulled apart and reassembled to respond to user queries. The data in TraitBank comes from many different providers. Every datum is labeled with attribution that can include a Creator, Publisher, Contributor, and a bibliographic reference. TraitBank users are asked to cite the original data provider, so that credit can be assigned to the data source. When I saw the Altmetric badge in OpenVIVO, my first thought was to apply these metrics to TraitBank data sets and add EOL as a source of attention. The data providers would have additional information about how their data are being used and EOL would have a better measure of how much value (in the form of increased attention) they were providing.

As it turns out, citation and attribution gets tricky when parts of thousands of data sets are recombined and analyzed to create a new data product. Most authors would rather cite one TraitBank download as the source of all their data instead of citing the hundreds or thousands of smaller data sets that make up the new data set. This is understandable in a printed manuscript, but it should be less of an issue in the digital age. In theory, an Altmetric donut can be applied to individual data points, data sets, and combinations of data sets. The citation of a published meta analysis that uses millions of data points from thousands of data sets should trickle all the way down to attribute the study that produced the original data. Important data sets (or even data points) could be identified and the provenance of meta analyses could be improved.

The “trickle down attribution” problem has existed almost as long as there have been scientific publications. Chains of citation are too easily broken or lost for articles and are much harder to track for individual data points. Recreating the chains using References Cited sections of published articles would likely result in misapplied attribution. Going into the future we can keep better track of use, but a major impediment is the lack of unique and persistent identifiers for data points and data sets. Assuming we had these identifiers in place, a standard for describing a data set and its constituent parts could provide the infrastructure needed to make “trickle down attribution” a reality.

Data Rescue in Tokyo

Two weeks ago I attended the 7th Plenary of the Research Data Alliance (RDA) in Tokyo. I enjoy attending RDA Plenaries not only because of the interesting topics but also because of all my wonderful RDA colleagues. I was recently appointed the RDA US Data Share Ambassador, so I work with the RDA US Data Share Fellows and promote their work on social media. In addition, I co-chaired a Birds-of-a-Feather meeting and attending meetings of the many Interest Groups in which I am involved. One of these is the Data Rescue Interest Group that focuses on rescuing old data sets from the dust bin of history, usually by migrating old data to a new, usable format. Their meetings usually consist of stories of successful data rescue, failed data rescue, and a discussion of how we can learn from these examples and stop losing data. Several success stories originate from the International Data Rescue Award in the Geosciences that recognizes efforts within Geosciences to advance preservation of and access to research data. The latest winners created an electronic catalogue of all the fossil collections in the UK. In addition, the Interdisciplinary Earth Data Alliance (IEDA) offers data rescue mini-awards that provide funds for the transfer of unpublished data to IEDA.

Personally, I enjoy hearing stories about finding and transforming data. Professionally, I know how frustrating it can be to use an old data set. That is why I think awards to recognize data rescue and funds to support data rescue are important for many scientific disciplines, not just Geoscience. There should be similar awards and funds available for the rescue of biological data sets, specifically in evolution and ecology. This idea is still very new. I’m hoping to find partners over the next year so that in 2018, there will be a winner for the International Data Rescue Award in Evolution and Ecology.

Citizen Science Data Integration

One of the projects I’m working on now is integrating data about North American butterfly observations collected by about a dozen different citizen science butterfly monitoring programs. The people who collect the data are volunteers who are assigned a specific route and keep track of all the butterflies they observe while walking along the route. Some observations have a latitude, a longitude, and a date. Some observations have a country, state, county, and a date. If someone wanted to find out if the abundance or distribution of a specific species of butterfly has changed over time, he/she could group the data sets by butterfly name to get the data for the analysis. 

Easy, right? Well, no.

Each monitoring program has its own list of species names for its volunteers to use and every list is different. This can happen for several reasons. The simplest, is that these monitoring programs are regional, so a volunteer in Colorado will not see the same butterflies as a volunteer in Louisiana. For the sake of simplicity, a monitoring program will not have species on their list that volunteers are not likely to see. Three other scenarios make lists difficult to integrate. First, two different monitoring programs may see the same butterfly, but disagree on what that butterfly is called. This actually isn’t so bad. We can say that butterfly X according to program A = butterfly Y according to program B. Problem solved. A more difficult problem is when one program differentiates two species that another program lumps together. This can happen either because a program may not recognize the existence of a species or subspecies that other programs do recognize or because two species might be too difficult for volunteers to differentiate. This is harder because if someone wanted to integrate data for one of the lumped species, there would be no way to differentiate individual data points in a data set from a program that lumped. The user would have to accept the lumped group and lump data from other programs in the same way or throw out the lumped data. The third problem is the trickiest. What if programs disagree about how to define the species themselves, not just what to call them? Let’s imagine that there are 50 individual butterflies (Fig. 1). One butterfly expert may determine that these individuals may be divided into two species. Another expert may divide them into four species and a third may agree that there are two species, but divide up the individuals differently from the first expert. Without extensive metadata about the individual that was observed, there’s very little that can be done to effectively integrate data sets with these discrepancies. In addition, monitoring programs may not make available their criteria for labeling a specific butterfly with a specific name. One generally assumes that if two programs are using the same name, that they are using it in the same way.

Butterfly Species

Figure 1: Fifty butterflies divided into species by three experts. Expert 1 (orange boxes) thinks there are two species present. Expert 2 (green boxes) thinks there are four species present. Expert (purple boxes) thinks there are two species present, but has drawn the boxes differently than Expert 1.

Back to my project. How can we manage these discrepancies between lists to make an integrated North American butterfly monitoring data set? The first step of our plan is to map each list to every other list. A user could then say “I have an observation of butterfly C by monitoring program 1. What is its equivalent in monitoring program 2?”. The exact nature of the mapping will have to wait until we know exactly how many and what kinds of difficult situations we have to cope with. I will make a pairwise comparison of each list using the newly developed Taxonomic Tree Tool. The tool was developed by an EOL Rubinstein Fellow and it produces some interesting visualizations of the differences between classifications.

Developing a unified list will undoubtedly be a lot of difficult work. Wouldn’t it be easier if these programs agreed on and used a single list? Maybe the unified list produced by this project? It would most definitely be easier for data managers, but not necessarily for the projects and their volunteers. These species lists have been in place for years, sometimes decades, and expecting the programs to change is unrealistic. A much better, long-term, solution is to make our mappings available. That way no one else will have to repeat this work and small changes can be added over time to keep the mappings up-to-date.

I want to find a good way to not only share the taxonomy, but to make it easy to update as our knowledge of butterflies improves. There seems to be no ideal way to do this, but there is one okay way to do it. Ideally, there would be a web service where a volunteer from one monitoring program could submit a list of names that they are using and get back a list of the corresponding names from another monitoring program. Changing the mappings could be done online by the monitoring programs directly. Theoretically, this is already possible through a tool called GNRD. This tool is a web service that can take a list of names and return the “correct” name according to a user-picked authority. Right now the available authorities include large nomenclators and aggregators like Catalogue of Life and the Encyclopedia of Life. In our case, each butterfly monitoring program could also be an authority. If we model the reconciled groups of names in a Darwin Core Archive, GNRD can use that file to map names. At the same time, I can submit the archive to GBIF as a taxonomic data set. This plan will make the list available and usable, but does not provide good way to update the mappings over time. Updates would have to be made by me changing the Darwin Core Archive.

Mapping species lists that are actively being used by citizen science monitoring programs will be a difficult, but important task. My colleagues on this project and I will publish this work, but I want to think beyond publishing to find a way to keep this mapping useful into the future.