Trickle Down Attribution

Last week I was in Portland, Oregon attending the annual meeting of Force11, a community interested in the future of research communications. There were many great speakers and panel discussions, but what interested me the most was the unveiling of OpenVIVO. Anyone with an ORCiD can “claim” their OpenVIVO profile. I logged in using my ORCiD and my research output was instantly imported into my OpenVIVO profile. As new works were added, I was asked to claim my role in creating them. These roles went far beyond traditional authorship. I could get specific credit for data curation, graphic design, being the equipment technician, and many other roles by clicking on check boxes. All of these roles were part of the VIVO-ISF ontology that helps standardize contribution types across institutions and disciplines.

I have an OpenVIVO profile that lists publications and data sets, but my profile information doesn’t stop there. Each publication and data set has an Altmetric badge. Here is an example from one of my more widely tweeted works. The badge is the “rainbow donut” in the upper right. Clicking on the donut will take you to a summary page at Altmetric that gives more information about how people have been interacting with my publication. Altmetric creates these colored donuts using data from 15 different “sources of attention”. The number in the middle of the donut is automatically calculated as a weighted count of all the attention the research product has received. It is hard to know the true meaning of these metrics, but I’m interpreting them as a measure of immediate interest. Time may prove otherwise, but I consider research products that received more attention to be more interesting to the community. I can get this information for publications and data sets, but what about individual data points?

Part of the work that I do for the Encyclopedia of Life involves EOL TraitBank, a semantic database for species traits. The TraitBank data model separates individual data points so that data sets can be pulled apart and reassembled to respond to user queries. The data in TraitBank comes from many different providers. Every datum is labeled with attribution that can include a Creator, Publisher, Contributor, and a bibliographic reference. TraitBank users are asked to cite the original data provider, so that credit can be assigned to the data source. When I saw the Altmetric badge in OpenVIVO, my first thought was to apply these metrics to TraitBank data sets and add EOL as a source of attention. The data providers would have additional information about how their data are being used and EOL would have a better measure of how much value (in the form of increased attention) they were providing.

As it turns out, citation and attribution gets tricky when parts of thousands of data sets are recombined and analyzed to create a new data product. Most authors would rather cite one TraitBank download as the source of all their data instead of citing the hundreds or thousands of smaller data sets that make up the new data set. This is understandable in a printed manuscript, but it should be less of an issue in the digital age. In theory, an Altmetric donut can be applied to individual data points, data sets, and combinations of data sets. The citation of a published meta analysis that uses millions of data points from thousands of data sets should trickle all the way down to attribute the study that produced the original data. Important data sets (or even data points) could be identified and the provenance of meta analyses could be improved.

The “trickle down attribution” problem has existed almost as long as there have been scientific publications. Chains of citation are too easily broken or lost for articles and are much harder to track for individual data points. Recreating the chains using References Cited sections of published articles would likely result in misapplied attribution. Going into the future we can keep better track of use, but a major impediment is the lack of unique and persistent identifiers for data points and data sets. Assuming we had these identifiers in place, a standard for describing a data set and its constituent parts could provide the infrastructure needed to make “trickle down attribution” a reality.