table missing leg

Biodiversity and the Big Data Table

Given the scale and heterogeneity of data about species and their environments, big data and semantic applications show promise for getting over the scalability hurdle inherent in addressing global scale biodiversity problems. Despite some important advances, these technologies have yet to reach their full potential in the biodiversity and environmental science disciplines. Why?

A fully functional big data “table” in biodiversity science requires maturation of four “legs”: 1. high-performance computing that can host large data sets and facilitate their analysis, 2. mass digitization of standardized data, 3. development of standards and ontologies, and 4. user interfaces that lower barriers for non-technical users. All four of these “legs” must be the same length to make a functioning table. For example, a fully developed system to host data will not be useful if there are no standardized, digital data to host. The best computing system in the world will not have many users if the interface is frustrating. The effectiveness of investments in one leg can be limited by a lack of investment in another. Currently, unequal investments in the four legs have resulted in a lop-sided table. Researchers are left telling skeptical users, “This will be a really great table one day, trust me.”

What can we do about it?

Most informaticists who work in biodiversity and environmental science know this is a problem, but are limited in how they can respond. Many of the important tasks of building the legs are not considered worthy of funding because they are not hypothesis-driven science. There aren’t many ways to fund this sort of work directly, despite its importance. Some funders realize that investments in infrastructure are worthy, but resources are still very limited.

The best way to move forward is incrementally. Instead of building each of the four “legs”, one at a time, sequentially, build the table up in short iterations, lengthening each leg in concert with the other. At the end of the iteration, there is a functioning “table” that will delight users on a much faster time scale. Then start on the next iteration and lengthen all the legs a bit more. This is the best way to hold a user’s attention, have a quick return on investment, and manage their expectations. Then, instead of saying “Trust me”, a researcher just has to say “Try it”.

How Fair is Big Data?

“Big Data” and machine learning are used in a wide variety of disciplines, from making credit and insurance decisions to driving medical research, but how accurate is this approach? If algorithms are ground-truthed used a biased population, the results of those algorithms will also be biased. Alex Lancaster has a great post on his blog at Biosystems Analytics about the potential consequences of using biased training data.

It’s often widely assumed that decisions made by algorithms are more “neutral” and “fair” than those made by humans….machine learning algorithms, specifically “classifier” systems, trained on statistically dominant populations, can sometimes lead to erroneous classifications.

read more at Biosystems Analytics