Can machine learning be used for accurate species identification of beetles and other invertebrates? Dr. Katie Marshall and Jarrett Blair at the University of British Columbia (UBC) sought to answer this question using carabid beetle data from the NEON program. Eventually, they hope to leverage machine learning to identify other species caught in the NEON beetle pitfall traps. Machine learning could one day be used to classify unidentified species in the NEON bycatch (species caught other than the target species) and answer new questions about invertebrate diversity and abundance across North America.
Classifying Carabids at NEON Field Sites
Carabids are a large family of insects commonly known as ground beetles. With more than 2000 known species in North America, and 40,000 worldwide, the Carabidae is one of the most species-rich taxa of invertebrate animals. They are found in nearly every ecosystem in North America. Their diversity and abundance make them a great taxon to study; shifts in carabid species diversity and range can provide important information about how ecosystems are changing.
That's why the NEON program collects and classifies carabids as part of the terrestrial observational sampling system. Ground beetles are collected in pitfall traps filled with a colorless and unscented preservative solution. A variety of other insects and arthropods also end up in the traps as bycatch. Carabids are identified to the species level wherever possible. Bycatch is not identified or classified but is preserved in the NEON Biorepository for interested researchers.
Dr. Katie Marshall, an assistant professor in the Department of Zoology at the University of British Columbia (UBC), and Jarrett Blair, a Zoology Ph.D. candidate at UBC, are among those interested researchers. While their ultimate interest is in the non-carabid species collected in the bycatch, they first needed to test their machine learning algorithm using known species. Because the carabids collected by the NEON program and available from the NEON Biorepository were already identified to the species level, they provided an excellent opportunity to test the algorithm.
Their results were published in Ecology and Evolution in November 2020: "Robust and simplified machine learning identification of pitfall trap-collected beetles at the continental scale." The study was funded through a National Science Foundation (NSF) grant.
Putting Machine Learning to the Test with Carabids
Blair and Marshall worked with collaborators from the University of Oklahoma (OU) to develop the machine learning algorithm for carabid identification. Machine learning is a form of artificial intelligence that looks for patterns in large datasets. The algorithm is fed training data—in this case, image data for carabids that had already been identified down to the species or subspecies level. By looking at large datasets, the algorithm learns which features in the data (such as body morphology and color patterns) are associated with different species.
"Once the program is trained," Blair explains, "the idea is that when you feed it new data, it will be able to make the identification fairly accurately. The goal is to classify beetle species with an accuracy that mirrors or exceeds that of humans."
For this project, they fed the algorithm extracted image data (such as specimen size, shape, and color) rather than raw pixel data. Dr. Michael Weiser at OU photographed carabid species from the NEON Biorepository and provided the extracted image data for the training set. An open-source software program was used to automate feature extraction from the raw images.
"The advantage of using extracted morphology data rather than raw images is that it provided important information such as body size," Marshall explains. "It also allows us to control the data that the program is learning from. If we just feed the model raw images, we don't really know what data it is extracting from those images to learn from." Using extracted image data also simplifies the dataset, allowing the program to run with much less processing power. "One of the goals was to create a model that would be accessible for other researchers. With this, you don't need a supercomputer to run the model—you can do it on your laptop."
Algorithm development was led by Blair, expanding on earlier work. As an undergraduate in 2017, he and a friend developed a startup company creating apps that use machine vision and machine learning for insect identification. Specifically, they were developing a program to help farmers identify pest insects on crops. "This project brings together my interests in entomology and machine learning," he says.
The program uses taxonomic classification. If it runs across a species it cannot identify—for example, a rare species that was not present in its original training set—it will attempt to classify it to a higher taxonomic level, such as genus or subfamily. Human taxonomists can then complete the classification and feed the species data back to the algorithm, so it can continue learning over time. The algorithm can classify hundreds of unknown individuals nearly instantaneously, greatly expanding the ability of ecologists to make use of currently unclassified preserved specimens.
In tests using NEON carabid data, the best-performing algorithm reached a species identification accuracy of ~85% when presented with unidentified image data. Providing location data improved accuracy to more than 95% at the species level and ~99% at the subfamily level. Human field technicians are expected to exceed 80% accuracy in species identification for their site locations. Blair says, "Eventually, we could turn this into a classification pipeline for the NEON program. Right now, the bottleneck is at the imaging and data extraction stage. But once that is done, it would take a fraction of a second to identify thousands of specimens."
Beyond Beetles: Looking at the Bycatch
The study serves as a proof-of-concept for using machine learning for invertebrate identification. Marshall and Blair next plan to expand their studies to the NEON program bycatch. The bycatch contains many additional invertebrate species of interest, including ants, spiders, and other arthropods, both common and rare. A better understanding of invertebrate populations will provide insights into ecosystems as a whole.
Marshall explains, "Invertebrates are the unsung heroes of an ecosystem. They may not be flashy or exciting, but, as E.O. Wilson said, they are 'the little things that run the world.' They do a lot of work, decomposing, making bionutrients available—without them, we would be losing a lot of biodiversity in our ecosystems. This is a great opportunity to study them and get a better understanding of the roles they play in different ecosystems."
To create an identification algorithm for bycatch from the NEON program, researchers will first need to create a training data set. This will involve manual species identification for the diverse species found in the NEON bycatch—a daunting proposition. At first, they are only aspiring to identify specimens to the family, subfamily, or genus level. This analysis will provide insights into the presence and relative abundance of different types of invertebrates at the NEON field sites.
To assist with identification, Blair and Marshall plan to join forces with another team using environmental DNA (eDNA) analysis. This team, headed by Dr. Cameron Siler at OU, is sequencing DNA extracted from the preservative fluid in the bycatch sample tubes to identify the arthropod species present. Combining eDNA methods with the imaging and machine learning work headed up by Marshall and Blair could bring together the best of both worlds. eDNA provides evidence of which species are present, but not the relative abundance of each species in the sample tube. The imaging process could use the eDNA evidence to improve species identification while providing a count of how many individuals of each species are present.
Geographical Ecology in Action
This team's project is a great example of geographical ecology—the study of how organisms are distributed over space. Large-scale ecological data like that gathered by the NEON program allows researchers to look for patterns in which species live in different regions and habitats and how that is changing over time.
Machine learning has the potential to greatly expand the ability to make use of the data collected, especially the bycatch that is not currently identified by NEON field researchers. "Some people ask, 'will machine learning steal my job?'" says Blair. "But this is not a replacement for human researchers. The goal is to create a machine-assisted data pipeline, where the algorithm will provide an initial classification that can be verified and refined by human experts. This will vastly speed up the identification process so researchers can get to the more interesting part of the job—asking and answering questions about ecosystems."
They expect that the NEON program will continue to play a large part in geographic ecology research. Marshall says, "I know of no other organization that collects so much systematic data on such a broad spatial scale. There is no other comparable dataset that would allow you to answer these questions across such a wide range of habitats, from tropical to tundra. It's really an incredible resource—I don't know how else you would ask these kinds of questions."