Combining Data with Models to Bring Continental Scale to Ecology
By Sandra Chung
It’s July 2012, and NEON Data Products Scientist Andy Fox is teaching summer school 9,500 feet above sea level. He is explaining data assimilation to a room full of scientists taking an advanced modeling course at the University of Colorado Mountain Research Station near Nederland, Colorado. Fox introduces the scientific problem at hand: NEON measures ecologically important processes with ground-based and airborne sampling over several square miles at each of its field sites. NEON aims to deliver data sets that describe ecosystem processes over the entire United States – an area tens of thousands of times larger than what the observatory directly measures. “What we have here at NEON is a massive case of extrapolation,” Fox says. To extrapolate is human. We can’t observe the entire world directly, so we fill in our blind spots with what, to the best of our knowledge, is likely to be there. Our most sophisticated extrapolation tools are computational models, simulations governed by mathematical and statistical relationships that describe our best estimates of how the world works. We use models to simulate and forecast presidential elections, hurricanes, water supplies, even the movements of video game objects in 3-D space.
Fox is one of many scientists working with a model that can make ecological forecasts. He and his collaborators are developing a method to combine NEON data and the Community Land Model (CLM) to both improve the model’s forecast accuracy and scale up observatory measurements across space. The process of combining models and data is called data assimilation. It originated as a way to improve weather forecasting, and has only very recently begun to be applied to ecological forecasting. Ecological processes, particularly continental-scale ones, aren’t tidy equations. Even without knowing the exact mathematical equations that govern every ecologically important process, researchers can use data assimilation to bring the ecological model states more closely in line with the real world. Assimilating data into a model as complex as CLM is pushing the limits of what is possible with current computer technology. But it is absolutely key to putting the continental scale in NEON’s continental-scale data sets, and to extracting the greatest possible scientific insight and predictive power from the growing flood of ecological and environmental Big Data. In making it possible to understand and forecast ecology across an entire continent, NEON will empower policy makers, and environmental managers and business owners with the information they need to make more confident decisions about natural resources for decades to come. Data assimilation will provide both more data and better analytical tools to help make that happen.
Challenging supercomputers and software with ecological complexity
Tim Hoar is an associate scientist at the National Center for Atmospheric Research. He helps researchers like Fox make use of NCAR supercomputing resources to test various data assimilation methods with different computational models. Hoar has worked with models that describe the dynamics of oceans, atmosphere, storms, climate, even Martian weather. He can run “a dizzying number of models in a very limited capacity,” he says. “CLM is absolutely one of the most complicated models I have ever run across,” Hoar says.
Every model starts with a set of initial conditions gridded across space. For example, a weather model’s initial conditions might include temperature, humidity and wind speed measurements taken at identical weather stations in different locations at the same time. The model fills in any blind spots over space by extrapolating from those measurements according to mathematical or statistical rules that describe how conditions in one grid cell correspond to conditions in another. The model uses another set of rules or algorithms to advance the initial conditions to the next point in time. The more accurate the initial information and rules, the more realistically the model extrapolates over space and time. Numerical weather prediction models run on rules derived from Newtonian physics, and just a handful of variables are enough to describe the movement of heat and moisture in the atmosphere around the world. Weather models used in meteorological forecasting require so many calculations over so many grid cells over so many short periods of time that only a supercomputer can handle the computational load. They’re so sensitive that a tiny bit of inaccuracy or imprecision in estimates and measurement quickly compounds into larger and larger errors and uncertainty, which is why weather forecasts become less reliable as they reach farther into the future. Ecological forecasting, in contrast, is conceptually more taxing than forecasting weather; and the equations governing ecological processes are not nearly as well-defined.
The dynamics of processes like the carbon cycle that involve both living and nonliving components of the environment include many steps and interacting players, and require a huge number and variety of measurements over a wide area and a long period of time to produce useful forecasts. For instance, the movement of carbon between plants, atmosphere, soil and water depends on interacting variables including air temperature, sunlight, precipitation, latitude, elevation, soil moisture and temperature, the types and density of microbes in the soil and vegetation on land.
CLM, an ecological land surface model developed at NCAR, subdivides each grid cell by land cover type: glacier, lake, wetland, urban or vegetated. The model further breaks down vegetation into vegetation types: trees or grass, broadleaf or needle leaf, evergreen or deciduous. It uses these classifications to tune its estimates of the rates of ecologically important processes like the movement of water, carbon and nitrogen in response to climate through different parts of the ecosystem. “We take our understanding of the physics and the biology which occur at a point and we use that information to paint by numbers across the landscape on the basis of what the vegetation’s like there,” Fox says. The “paint by numbers” procedure makes it possible to describe what’s probably happening across a vast swath of space with measurements from just a few specific locations. But the amount of calculation needed to do so is enormous, Fox says. The CLM needs a table of information hundreds of thousands of rows tall and hundreds of thousands of rows wide to extrapolate across space at a given point in time. Moving the model forward through time requires still many more calculations for each of the grid cells. On top of that, the data assimilation method Fox uses requires running 64 instances of the model at once. Open source software that Hoar helped develop, the Data Assimilation Research Testbed (DART), helps split this massive computational task between the many processors of a supercomputer. It's necessary to run so many instances of CLM because each set of initial conditions and uncertainties in the weather model used to drive the CLM usually gives slightly different results. Repeated runs of the model produce a range of solutions rather than a single one. Simple statistics transforms that spread of solutions into an average value with a specific amount of uncertainty to it. Uncertainty may sound like a handicap, but it is a fact of life that no measurement is ever 100% certain. In fact, a little uncertainty is essential to making Fox’s data assimilation method work.
Uncertainty is essential
Every measurement has some amount of uncertainty to it. Your GPS, your car’s speedometer, and the police officer’s radar gun might clock you at slightly different speeds over the exact same period of time. They can’t all be correct. Reflections off of nearby objects often throw a GPS signal off course. Rainy or foggy weather wreaks havoc on the radar gun’s accuracy. And your speedometer’s accuracy changes as you wear down or replace your tires. None of these measuring devices performs perfectly all the time, but by testing each device under different conditions it’s possible to estimate how much the device’s measurements might be affected by the circumstances and by the limitations of the devices themselves. The radar gun might be within five miles per hour 95 percent of the time in typical police use; that figure is a measure of uncertainty. NEON observations have uncertainty as well. Standardization and quality control processes keep that uncertainty within a target range and as consistent as possible. The data assimilation method Fox uses compares the model states to real-world observations at a given point in space and time and adjusts the model states accordingly.
DART adjusts the model states to be more consistent with the observation while taking into account the uncertainty of the model states and the uncertainty of the observation as it relates to the model state. The model can then move forward from a more reliable set of initial conditions to advance to the next point in time. “I’ve been doing data assimilation for 10 years,” Fox says. “It used to be the models that we did this with had 20 lines of computer code. This model is hundreds of thousands of lines of computer code. It’s a massive software engineering challenge.” Just getting the CLM to “talk” to DART has been a major task for Fox and Hoar. Fox has logged many processor hours on the NCAR supercomputer Bluefire, and he has booked thousands more processor hours on Bluefire’s successor, the powerful new Yellowstone supercomputer at the new NCAR-Wyoming Supercomputing Center. Even so, he’s still looking only at carbon cycling and running the model at a very limited scale. Tackling a problem this big is a team effort. The growing data assimilation community has been churning away at the CLM problem and others like it, sharing solutions, ideas, and improving resources like open-source software.
Software that enables the many processors of a supercomputer to churn away simultaneously at separate chunks of the problem makes the work go much faster, and empowering others to contribute to the work serves the same purpose at a human scale. “I have people every week asking me how I do this stuff,” Fox says. “We do our best to help.” Fox estimates that he and his collaborators are 10 percent done tackling the specific problem of data assimilation and CLM. He has been working with carbon flux data from the existing Ameriflux network, but in a few short years, NEON will be producing huge quantities of data of many types that need to be assimilated into the model as well. Other researchers are working on refining the model itself based on improving ecological data and increasing knowledge of ecological processes. As computing continues its exponential growth pattern, we can expect ever more powerful supercomputers and software to handle the complexity and scope of CLM with greater speed and ease in the near future. Once Fox and his colleagues do get it all to work as planned, they will have created a means to unprecedented ecological data and analytical power. The eventual payoff to society could be huge. Data scientists are already using Big Data and computational modeling to predict everything from product sales to election results and to help many of the most successful businesses in the world meet their goals. NEON data and analytical tools will provide similar high-quality information for natural resource management, and will help society navigate the trials of environmental change with foresight of unprecedented depth and clarity.