The NEON project is producing a vast treasure trove of open access remote sensing data. Can computer algorithms help ecologists make sense of it all?
A team of ecologists and data scientists at the University of Florida thought so. To accelerate the process, they initiated a data science challenge. The NIST-DSE Plant Identification with NEON Remote Sensing Data Challenge invited teams to compete to develop algorithms that could correctly classify and delineate trees using NEON remote sensing data. The results of the challenge were recently released in PeerJ: A Data Science Challenge For Converting Airborne Remote Sensing Data Into Ecological Information.
Data Competitions and the Rise of "Big Data" Ecology
The challenge was organized by researchers from the Data Science Research lab (led by Daisy Wang), the WEecology lab (led by Ethan White), and Stephanie Bohlman’s lab at the University of Florida in active collaboration with scientists from the NEON project. It was sponsored by the National Institute of Standards and Technology (NIST) Data Science Evaluation (DSE) Series and the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative.
Data challenges have become a common strategy for advancing data science in many fields, ranging from image classification to finance. The competitions invite individuals or teams of data scientists to work independently to solve a specific problem. The solutions are then evaluated against a common rubric to determine which approach works best for the challenge presented. Often, each solution has its own set of strengths and weaknesses, and the "ideal" solution may combine elements of two or more different submissions. Looking at the insights gained through each solution can significantly accelerate the rate of progress in solving complex "big data" problems.
Until now, this approach has not been used widely to solve ecological problems—in fact, the Remote Sensing Data Science Challenge is believed to be the first of its kind. Sergio Marconi, a PhD student in interdisciplinary ecology at the University of Florida and one of the organizers of the challenge, says, "We are facing a moment in which the use of big data in ecology has become a reality. We need to bring ecology and data science together if we're going to get value from all of the data we now have available."
Putting Remote Sensing Data to Work
The organizers decided to use airborne remote sensing data from the NEON project because the data are freely available, consistent from site to site and year to year, and represent a large and growing data set that provides an ideal testing ground for data science solutions. Sarah Graves, a doctoral candidate in the School of Forest Resources and Conservation at the University of Florida, explains, "Remote sensing data is an important tool for understanding forest composition and structure. Using computer algorithms to analyze the data will speed up this process by orders of magnitude. But before that is a viable solution, we need to test those algorithms to see how accurate they are and how their results compare to data gathered on the ground."
The Remote Sensing Data Challenge asked participants to tackle three problems:
- identify the location and size of individual trees (crown segmentation)
- match remote sensing data on individual trees to data verified through ground observations; and
- classify individual trees by species using the remote sensing data.
Participants were each given identical remote sensing data collected by the NEON AOP over the Ordway-Swisher Biological Station in north-central Florida. The NEON project uses three Airborne Observation Platforms (AOPs) -- each AOP includes a hyperspectral imaging spectrometer, a full waveform and discrete return LiDAR, and a high-resolution Red-Blue-Green (RGB) camera. Once installed into a Twin Otter aircraft, the AOP is flown over NEON field sites annually to collect a variety of data products. The data challenge used four of these data products: LiDAR point cloud data, LiDAR canopy height model (CHM), hyperspectral surface reflectance, and high resolution visible color (RGB) photograph. Teams used these data products to identify individual trees and classify them by species. The results were compared to observational data gathered on the ground to determine the accuracy of each algorithm.
Six teams, comprised of 16 individual participants, entered the data challenge. Several of the algorithms performed well for the species classification part of the challenge, with the best correctly classifying 92% of the trees. The crown segmentation part of the challenge proved to be more challenging, with the highest-performing algorithm generating a 34% overlap with data verified through on-the-ground observations.
Bringing Data Science and Ecology Together
While none of the algorithms performed perfectly, they are a big step forward from out-of-the-box software used for other types of image classification. Organizers noted that the best performing teams combined both data science expertise and an understanding of the subject area.
Ultimately, the organizers see the data challenge process as collaborative rather than competitive. Each of the teams contributes something of value and they all learn from each other's approaches. Sergio says, "We’ve made huge gains in our ability to tackle these kinds of problems in impressively little time, even with a small number of groups. We're able to see how different algorithms work and which work better in different situations. There is never just one right solution. It's about finding which solutions work best for the task at hand."
This year's data challenge may be the first of many. The team hopes to make this an annual event, gradually increasing the complexity and difficulty of the challenge questions. Moving forward, they plan to continue to use NEON data. Sarah says, "The NEON data is a great dataset to use for these types of competitions because it's easily accessible to everyone for free and the documentation for how data are collected is very clear and transparent. And we know these data will continue to be collected in a standardized and open way for 30 years. That means we can use the same type of datasets over time, but ask tougher questions and ask participants to do more with them."
Moving forward, both Sarah and Sergio say they would like to see more collaboration between data scientists and the ecology community. They both reference participation in past NEON Data Institutes as one of the factors that led to their interest in the idea of a data challenge. Sergio says, "The Institute helped me understand how to approach ecological data from a data science perspective and what data scientists would need to make the data usable."
Building tighter connections between ecologists and data scientists will help the research community maximize the value of large data sets from the NEON project and other observatory networks. Ultimately, this approach will accelerate scientific progress. Sarah says, "We've learned so much by working with computer science students. Working across disciplines is challenging, but it is when we are challenged that we are able to start asking better questions. That's where the real advances are made."
Learn more about NEON's airborne data.