Automated Ontology Building in Ecology

One of the more difficult aspects of trying to apply “big data” thinking in ecology is the massive heterogeneity of terms. I stumble over this issue every time I work on a data set for the Encyclopedia of Life. The many different ways to describe the same habitat (among other things) and the varying granularity with which people describe habitats make it very difficult for data consumers to find, for example, all the beetles that live in the desert. It’s doubly more difficult to go a step further and ask for traits of beetles that live in deserts, like color, for example.

As a side note, that example is very similar to some use cases I published with several colleagues about ways to combine phenotype and environment data.

Right now, we can ask Google “How much does a narwhal weigh?” and get the answer because of the fine work my EOL colleagues and I have been doing on TraitBank (go ahead, try it), but we’ve still got a way to go before we can ask “What color are beetles that live in the desert?”. We have a plan, though, and it involves semantic technology, i.e. ontologies.

Biology already has many ontologies available for use of varying quality. Most of them can be found at OBO Foundry. Not all domains of biology have good ontologies available, for example, ecology has been left out. That means there is no standard, machine-readable way of expressing which organisms are autotrophs, or nocturnal, or use camouflage, etc. Including terms such as these in an ontology is one of the many necessary steps before we can ask “Which organisms are nocturnal in an alpine forest habitat?” or, if we want to get more complicated, “Is there a relationship between the phylogeny of terrestrial, nocturnal organisms and latitude or elevation?”.

Building an ontology is a large, never-ending, hugely complicated task. One of my clients at University Colorado, Boulder, is the ClearEarth project. The goal of this project is to repurpose NLP and ML algorithms developed for biomedicine for use in geology and biology. These algorithms can read text and automatically generate ontologies. We’ve made a lot of progress annotating domain-specific text and will have some “auto-ontologies” by this summer. Very exciting! To support this effort and make sure the ontologies resulting from this project are meshed in with existing bio-ontologies, we are hosting an “ontology-a-thon” in Boulder this summer. Please take a look and apply, if you are interested in participating. We don’t have a detailed agenda just yet, but the idea is to get ontology and ecology experts in one room to curate the auto-ontology. All expenses paid, but space is limited.

sea otter

Keystone Predators and Centrality: Ecosystem as Social Network Part 2

My last post looked at a very small, but well studied rocky intertidal ecosystem and was able to identify a keystone predator (Pisaster) in a network using centrality measures. I was worried, though, that this method would not work on a larger, more complicated system. Let’s try these same calculations on a slightly larger kelp forest ecosystem. These systems are commonly found on the west coast of continents and are characterized by the presence of large kelps. The sea otter, Enhydra lutris, is an important predator of herbivores (e.g., sea urchins) that eat macroalgae. In the absence of sea otters, sea urchin populations explode and overgraze the kelp. Will centrality measures be able to identify the sea otter as a keystone predator? In this network, I had trophic interactions, competition interactions, and new “habitatFor” interactions that described a relationship between two taxa wherein one provided habitat for the other. My initial list contained 69 interactions and the centrality measures were all pointing to the kelp as the keystone species. This is likely because the kelp provided habitat for nearly every species in the kelp forest.

This raises an interesting question regarding our definition of keystone species. Without the kelp there is no kelp forest and that’s why the centrality measures pointed to the kelp, but the sea otter is thought of as the keystone species is this system. An important part of the definition of a keystone species is its relative abundance. Keystone species are supposed to have a disproportionate effect on the ecosystem relative to its abundance. The kelp have a very large effect, but are also very abundant. The sea otters have a large effect and are nowhere near as abundant as the kelp. That is what makes the sea otter the keystone species and not the kelp. I can’t help but think that otters being cute and cuddly while kelp are cold and slimy has something to do with it.

An algorithm that identifies kelp as a keystone species of a kelp forest is not very helpful. The kelp are more of a foundation species. How can we identify the sea otter as a keystone species even though the kelp are far more influential? One strategy that is most direct is to include the relative biomass of each taxon, but this is often not known and not included in databases of networks and interactions. I am going to try and find a way to make the network calculations work, but the results of the various network measures are not very helpful (most point to the kelp as the most important) except for closeness vitality, which is highest for the sea otter. When I do the calculations on a network made up of only the trophic interactions, as I did with the rocky intertidal system, the sea otter comes out on top in all the centrality measures. This supports the importance of dividing the network by interaction type before analysis.

Two additional issues come to mind:

  • How can I compare centrality measures across networks with different numbers of nodes, edges, and different degrees of connectivity?
  • How does the size and granularity of a network affect the results of the connectivity calculations?

The first issue is relatively straightforward. The calculation results can be normalized against the highest value; thus, the highest result for each network is always 1. When I do this normlization, the values for Pisaster and the sea otter are both 1 and thus comparable.

To explore the second issue, I played a few games with the interactions in the kelp forest ecosystem. In the original list of 69 interactions, I have some that are a bit repetitive:

  • Enhydra lutris, eats, Strongylocentrotus franciscanus
  • Enhydra lutris, eats, Strongylocentrotus purpuratus
  • Enhydra lutris, eats, Strongylocentrotus droebachiensis

Strongylocentrotus is a genus of sea urchin. Each species of sea urchin is listed as eating the same five species of macroalgae. So, the network has three nodes (the three Strongylocentrotus nodes) with identical edges. What happens to the results if I collapse these three identical species nodes into one genus node? The answer is not much. The kelp still has the highest connectivity in the network containing all the interactions and the sea otter still has the highest connectivity in the network with only trophic interactions. In the end, I collapsed the urchins into one genus, but the macroalgae was grouped by annual kelp and perennial kelp. Clearly, I need to develop some guidelines for lumping nodes consistently. Considering the high degree of taxonomic change in some groups, having genus- or family-specific nodes may be more desirable than species-specific nodes. In some cases a node defined by function instead of taxonomy may be better.

The data files for this work can be found in the github repo.

The sea otter image is CC-BY-NC from Biopix.

Keystone Predators and Centrality: Ecosystem as Social Network Part 1

A few weeks ago I announced a project that would train an algorithm to recognize important taxa in an ecosystem using the characteristics of species interactions within that ecosystem. This post documents the first bit of work I’ve done. I’ve made a github repo with data and code. I’m using Python 2.7 with NetworkX.

First I thought I would start with a simple, well-studied ecosystem, the rocky intertidal system made famous by Robert T Paine. This system has eight taxa with 24 relationships between them. This system has a keystone predator, the starfish Pisaster. All the interactions in this network are either “eats” or “competes”. If I run a medley of calculations over this network I notice pretty quick that there’s no super clear way to pick out Pisaster as the keystone predator. There are also several parameters that don’t seem to be all that useful for this application. At least I got the code to run, though. I was able to make the data file, use it to make a graph, visualize the graph, and do some calculations.

I tried doing the calculations again, only I left out some of the less helpful calculations and separated out the “eats” interactions. Since Pisaster is a keystone predator, I thought examining the trophic relationships separately might be worth a try. These results were much more interesting because Pisaster has the highest value for five centrality measures. Centrality might be a way to identify keystone predators. There are others who have also had this thought (here and here), so I feel confident I am on the right track.

My next worries:

  1. Is this ecosystem too small to tell me anything real? I need to work on a larger, more complicated network to see if this pattern holds up.
  2. Can I use the method of separating out a specific type of interaction to identify other types of important species, such as ecosystem engineers or keystone pollinators? I need to find well-studied ecosystems that have other types of important species.

Featured image by D. Gordon E. Robertson – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6434467

network icon

Ecosystem as Social Network

Lately, I’ve been thinking about how interactions between organisms in an ecosystem can be represented as a graph, with nodes and edges, similar to a social network. The nodes represent an organism or group of organisms while the edges represent the relationship between them. For example, a graph representation of an African savanna ecosystem would have lion and zebra as two nodes connected with an edge representing a predator/prey relationship. I began to wonder if some of the methods for gaining insights in social networks could be applied in ecology. There is an entire field of mathematics devoted to analyzing networks. These analyses can identify things like important nodes and sub-networks. Could I use these maths to identify important species? If the answer is yes, can I then use these derived characteristics to train a learner to identify important species in a database of interactions?

Interesting questions. Where to start? Fortunately, there is a major species interaction database called GloBI that I can use as a data set. There are several ecosystems that have been extensively studied and have data about the relative importance of taxa that I can use as training and test data sets. I’m not the first person to think along these lines. Lundren and Olesen, Estrada and Bodin, Gonzalez et al., Dunne et al., Steinhaeuser and Chawla, and Jordan et al. have all published studies looking at network structure of ecosystem graphs. Their work gives me some hope that this might actually work. My contribution will be designing a learner that can identify important taxa in a database of interactions not expressly designed for this purpose.

I will be doing the analysis using the Python networkx library. I would like to focus on ecosystems of different sizes, granularity, and types for training and testing. I want to capture important predation, habitat creation, and pollination interactions. I think I’ll start with rocky intertidal ecosystems and yellowstone national park. Both systems have been well studied. The first task will be getting the data in a usable format for analysis. Stay tuned!

How Fair is Big Data?

“Big Data” and machine learning are used in a wide variety of disciplines, from making credit and insurance decisions to driving medical research, but how accurate is this approach? If algorithms are ground-truthed used a biased population, the results of those algorithms will also be biased. Alex Lancaster has a great post on his blog at Biosystems Analytics about the potential consequences of using biased training data.

It’s often widely assumed that decisions made by algorithms are more “neutral” and “fair” than those made by humans….machine learning algorithms, specifically “classifier” systems, trained on statistically dominant populations, can sometimes lead to erroneous classifications.

read more at Biosystems Analytics