GenBank is a repository for genetic sequences sponsored by the National Institutes of Health. It holds over 100 million different sequences from organisms across the tree of life, from humans to mushrooms to amoebae. Studying genetic sequences and how similar or different they are from one species to the next can reveal a lot about the history of life on Earth. Important questions about the evolution of photosynthesis, parasitism, disease, etc. are being studied using genetic sequences in GenBank. It is extremely important that the researchers have access to as much available sequence data as possible to get as complete a picture as possible. Yet, the sequence is not the only important piece of data.

In addition to the sequence itself, depositors can submit metadata about the sequence, but this is not required. One piece of metadata that can be very important for understanding how the sequences relate to each other is the species name. For example, if we want to understand the evolution of mammals, we need more than just a collection of sequences, we need to know which sequences came from which species. Otherwise, we can’t relate what we have learned from the sequences to what we already know. If a sequence is labeled with the wrong name or with an imprecise name, an analysis can be ruined.

Mislabeled sequences in GenBank are not a new problem. Researchers know that sequence metadata can be inaccurate, but finding and correcting errors has always been a manual process. To speed this along, algorithms have been developed to find mislabeled sequences and correct them by comparing sequences and their names. This method works well when the sequences and their incorrect labels are from very different organisms, i.e. prokaryotes and eukaryotes. For example, if a sequence is labeled as Homo sapiens, but is very similar to several sequences labeled as Mycoplasma and very dissimilar to several sequences labeled as Homo sapiens, then the sequence is likely from a Mycoplasma contaminant of a Homo sapiens sample. Will this method work when sequences are mislabeled with names from the same taxonomic Family?

Finding sequences that have been mislabeled with a name from the same genus is much more difficult than finding sequences that have been mislabeled with names from a different genus. Finding and correcting name errors in sequences from single-celled organisms can be even more difficult because the names themselves have not been worked out. To investigate this, I looked at sequences from the genus Gymnodinium, a group of single-celled organisms called dinoflagellates. A sequence from Karenia selliformis was mislabeled as being from Gymnodinium maguelonnense. This sequence is 99% similar to several other sequences also labeled as Karenia selliformis and was eventually relabeled with the correct name (although not until after two phylogenetic analyses were performed using the wrong name). This error was detected manually, not algorithmically, because someone noticed that the cell culture the sequence came from had been improperly named.

One way to minimize taxonomic errors in phylogenetic analyses is to identify sequences that have enough documentation that we can be sure the name label is correct. Then, we can assume that matching sequences belong to the same taxon. A good example of this is the sequence from the Gymnodinium microreticulatum type culture. The paper accompanying this sequence contains images and morphological descriptions of the cells in the type culture and the author confirmed that this sequence came from the type culture described in the paper. This sequence is definitely correctly labeled as Gymnodinium microreticulatum. All the other sequences in GenBank that are 95% or more similar to this sequence are also labeled as Gymnodinium microreticulatum. This suggests that there is some hope of order. Potentially, these well-documented sequences can be used to correct wrong names. Unfortunately, this is a time-consuming strategy. Not all sequences and names group together so well, especially for microbial eukaryotes. For example, the sequence from the type culture of Gymnodinium trapeziforme is most similar to a sequence labeled as Gymnodinium nolleri and several sequences with 95% similarity are labeled as Nusuttodinium poecilochroum. This suggests that things are a bit more confused – likely due to a combination of taxonomic confusion and mislabeling.

A comprehensive cleaning of microbial eukaryote sequence metadata in GenBank is not likely to happen any time soon. A partial solution may be adding the ability to tag sequences as “gold star” if users can access documentation to confirm the taxonomic identification. The process of confirming the appropriateness of the taxonomic name would still be time-consuming, but would only have to be done once. In this way, the confirmation of GenBank metadata can be crowdsourced and progress can be made.