Semantic Linking of Phenotypes and Environments


One of the fundamental goals of biology is understanding the interactions of environment and phenotype, but this is a surprisingly difficult topic to study – not because of the concepts, but because of the data. Observations about environment and phenotype occur in separate data sets and the terms used are far too idiosyncratic for automated integration. Several biological domains, including conservation and phylogenetics could be advanced if these two data types could be easily merged on a large scale.

I led a recent paper, published in PeerJ, which suggests that the use of ontologies to standardize and link data about phenotypes and environments can enable scientific breakthroughs by increasing the scale and flexibility of research. This paper was a product of a workshop facilitated by the Phenotype RCN and supported by the National Science Foundation. My co-authors and I give several domain-specific use cases describing how an ontology can help advance science in four biological sciences. We then discuss the challenges to be addressed, present some proof-of-concept analyses, and discuss existing ontologies. The summary contains three suggestions for increasing interoperability between phenotype and environment data.

We hope this paper provides you with an overview of the landscape of ontologies available for integrating environmental data, and inspires you to use them in relation to your own data. For more information about ontologies and semantics, a good first read is Semantic Web for the Working Ontologist by Dean Allemang and Jim Hendler.

How Fair is Big Data?

“Big Data” and machine learning are used in a wide variety of disciplines, from making credit and insurance decisions to driving medical research, but how accurate is this approach? If algorithms are ground-truthed used a biased population, the results of those algorithms will also be biased. Alex Lancaster has a great post on his blog at Biosystems Analytics about the potential consequences of using biased training data.

It’s often widely assumed that decisions made by algorithms are more “neutral” and “fair” than those made by humans….machine learning algorithms, specifically “classifier” systems, trained on statistically dominant populations, can sometimes lead to erroneous classifications.

read more at Biosystems Analytics

Finding Dark Data

One of my recent clients was working on optimizing a hydrodynamic, particle-tracking model for predicting the fate and transport of oil droplets in the Gulf of Mexico during an accidental marine spill. To do this, the client needed a database of field observations of hydrocarbons and various oceanographic parameters to validate model output. I was hired to find these data and reformat them for inclusion in the database. I needed to find as many field observations as possible from the Gulf of Mexico from 2010 to 2011. Many relevant data sets were easily located in large repositories and in the published literature. As the database rapidly filled, I realized that I had no idea how effective I was because I did not know the total amount of data available. The goal posts were hidden.

The majority of research output in the United States, the result of billions of US dollars in tax-payer money, is nearly impossible to find (Heidorn 2008). These data are part of the “long tail” of science. A graph showing the number of data sets in order of decreasing size demonstrates the long tail (Figure 1). There are a small number of very large data sets (left side of the graph) and a large number of smaller data sets (right side of the graph). The area under the curve on the right side is much larger than the left, meaning that even though these data sets are small, combined they are a massive body of work. Data sets on the right side of the graph are characterized by their heterogeneous and distributed nature, making it much more difficult to manage and preserve these data compared to data sets on the left side. As a result, these data sets are difficult or even impossible to find. This is why these data sets are “dark”, which can be an expensive problem when data are lost or have to be collected again.Graph showing distribution of data sets by size

Figure 1: Distribution of Data Sets by Size

Finding relevant data, especially if the needed data are dark, can be a difficult and lengthy task. A common way for researchers to discover the data they need is through a combination of searching published scientific literature, through word-of-mouth at scientific conferences, and searching data repositories. The data I needed were relatively new (two years) and it was likely that much of it was not yet deposited or part of a published study. I didn’t have time to wait for a conference to ask colleagues for data leads. Presentations and publications are developed after data are collected and analyzed, relatively late in the research workflow (Figure 2). Was there a way to discover data based on events earlier in the research workflow? After some thought, I realized that databases and lists of awards made by funding agencies were an excellent source of information about potentially relevant data sets and who was likely to have them. Not only did I have a description of the project and the researchers’ contact information, I had an excellent way to approach the researcher to ask for their data, i.e., by asking them specifically about the results of a funded project. I was able to identify several additional researchers with relevant data, many of which were unlikely to ever be published. In addition, being able to ask specific questions about data from a specific project increases the likelihood of a response from the provider.

Simplified timeline of the research workflow

Figure 2: Simplified Research Work Flow Timeline

Adding funding agency award databases to my list of places to find data has helped me serve my clients by making it easier for me to find relevant dark data that are often in the long tail of science and providing a context that increases the likelihood of a successful request for data.

*To learn more about dark data and the long tail of science see “Shedding light on the dark data in the long tail of science” written by P.B. Heidorn and published in 2008 (Library Trends 57(2):280-299).