Finding Dark Data

One of my recent clients was working on optimizing a hydrodynamic, particle-tracking model for predicting the fate and transport of oil droplets in the Gulf of Mexico during an accidental marine spill. To do this, the client needed a database of field observations of hydrocarbons and various oceanographic parameters to validate model output. I was hired to find these data and reformat them for inclusion in the database. I needed to find as many field observations as possible from the Gulf of Mexico from 2010 to 2011. Many relevant data sets were easily located in large repositories and in the published literature. As the database rapidly filled, I realized that I had no idea how effective I was because I did not know the total amount of data available. The goal posts were hidden.

The majority of research output in the United States, the result of billions of US dollars in tax-payer money, is nearly impossible to find (Heidorn 2008). These data are part of the “long tail” of science. A graph showing the number of data sets in order of decreasing size demonstrates the long tail (Figure 1). There are a small number of very large data sets (left side of the graph) and a large number of smaller data sets (right side of the graph). The area under the curve on the right side is much larger than the left, meaning that even though these data sets are small, combined they are a massive body of work. Data sets on the right side of the graph are characterized by their heterogeneous and distributed nature, making it much more difficult to manage and preserve these data compared to data sets on the left side. As a result, these data sets are difficult or even impossible to find. This is why these data sets are “dark”, which can be an expensive problem when data are lost or have to be collected again.Graph showing distribution of data sets by size

Figure 1: Distribution of Data Sets by Size

Finding relevant data, especially if the needed data are dark, can be a difficult and lengthy task. A common way for researchers to discover the data they need is through a combination of searching published scientific literature, through word-of-mouth at scientific conferences, and searching data repositories. The data I needed were relatively new (two years) and it was likely that much of it was not yet deposited or part of a published study. I didn’t have time to wait for a conference to ask colleagues for data leads. Presentations and publications are developed after data are collected and analyzed, relatively late in the research workflow (Figure 2). Was there a way to discover data based on events earlier in the research workflow? After some thought, I realized that databases and lists of awards made by funding agencies were an excellent source of information about potentially relevant data sets and who was likely to have them. Not only did I have a description of the project and the researchers’ contact information, I had an excellent way to approach the researcher to ask for their data, i.e., by asking them specifically about the results of a funded project. I was able to identify several additional researchers with relevant data, many of which were unlikely to ever be published. In addition, being able to ask specific questions about data from a specific project increases the likelihood of a response from the provider.

Simplified timeline of the research workflow

Figure 2: Simplified Research Work Flow Timeline

Adding funding agency award databases to my list of places to find data has helped me serve my clients by making it easier for me to find relevant dark data that are often in the long tail of science and providing a context that increases the likelihood of a successful request for data.

*To learn more about dark data and the long tail of science see “Shedding light on the dark data in the long tail of science” written by P.B. Heidorn and published in 2008 (Library Trends 57(2):280-299).