Controlling the Firehose: Mapping Data's Route From Field to Browser
By Sandra Chung
In the East Wing of NEON headquarters, a door marked “Data Center” opens to a chilly, white-walled room. Eight tall cabinets with black mesh doors stand in a row across the middle of the room. “We’re in the third cabinet,” Tony Hays says, pointing to the one with the highest concentration of blue and green LED lights. By “we,” Hays means NEON’s scientific data servers, which he administers. Half of the 30 terabytes of NEON’s current database capacity is now in use. That’s more than enough capacity to house the entire text collection of the Library of Congress, and it takes up less space than a refrigerator. The other seven cabinets are all at least half-empty, as is the room itself, leaving plenty of space to grow.
NEON hasn’t yet deployed its full fleet of thousands of sensors making frequent measurements on thousands of separate plots around the country. But once we do, Hays predicts that our data storage needs will grow to petabytes (50 petabytes is one estimate of the capacity needed to store the entire written works of humanity since the beginning of recorded history). “We’re building pipes and a foundation big enough to support whatever comes,” Hays says. Hays and his colleagues on the cyberinfrastructure (CI) team are in charge of getting all of NEON’s raw data into a usable format and into the hands of users all around the world. The team of 10 is slated to expand to 26 by the second year of construction, says NEON computing director and cyberinfrastructure lead Robert Tawa. In addition to creating data storage space, extra hands and brains will be needed to engineer a suite of applications, instruments and computing systems to receive, process, manage, and deliver the enormous amount of scientific data that NEON will collect over the next three decades. Many ecological researchers are accustomed to coordinating a few trips to a few field sites to collect a few types of data over a few years. NEON operates on a much larger scale. Collecting and processing hundreds of different types of data that describe life, land, water and air across the country over 30 years requires extensive planning and preparation. With operations, equipment and personnel spread between more than 60 sites around the nation, NEON needs a cyberinfrastructure that provides for the swift, well-coordinated movement of information between many different users and locations. Moreover, NEON needs to standardize and automate many important tasks across the entire organization to ensure that the observatory functions as smoothly and efficiently as possible and produces data of a consistent and usable quality. “CI is essentially a data factory,” Tawa says In addition to storing and processing data, the CI team collaborates with people from every department in NEON to develop tools that help myriad aspects of NEON operations plug into the data handling support structure. Those tools include mobile applications to guide data collection and a web portal to make the processed data sets available to users around the world. Weaving through it all is a massive asset management system that CI team members are tailoring to coordinate data collection, construction and maintenance, and to track each and every specimen, sensor and screwdriver NEON needs to analyze, buy or build. “CI is really the only component of NEON that touches every other component,” Tawa says. “Business, science, engineering – we’re in the mix for pretty much everything except human resources.”
For NEON to accomplish its mission of serving up 30 years of high-quality ecology and climate data on a continental scale, hundreds or thousands of human technicians have to make hundreds of different kinds of measurements in a consistent manner, year after year, at thousands of plots across the country. Mobile applications can help automate and standardize much of the data collection process, saving time and reducing the potential for costly human error, says software engineer DJ Spiess. Spiess worked with NEON staff and outside developers on a prototype NEON birding application for the iPad. The application contains a birding field guide and a map of birding sites and prompts technicians to enter weather information and specific details about each bird observation. The iPad sends the data directly to NEON’s servers, eliminating the extra step of entering information from handwritten data sheets into a computer. Future NEON mobile field applications could literally walk the technician through the data collection process and require checkpoints like scanning a barcode on a marker in the field to ensure the technician is in the right location, Tawa says. Barcode tags are being used to keep track of all sorts of pieces of NEON’s operations, from calibration equipment to sensors to individual beetle legs sent off for DNA analysis. Barcode information moves within a computerized asset tracking system that sprawls throughout the entire organization, coordinating and monitoring essential parts of NEON’s operations. The tracking system may someday schedule preventive maintenance and repairs on the observatory towers; issue work orders that specify the exact parts, cost and assembly directions for equipment designed by NEON’s engineering department; and monitor the calibration status of each and every one of NEON’s thousands of sensors. It is already routing purchase requests and interfacing with other computer systems used to manage NEON’s financial transactions. It will also track where and when data collection occurs and who or what’s producing it. “There won’t be any data coming from a sensor in the field that we don’t know about,” Tawa says.
Once data from a sensor in the field arrives at NEON’s data servers in Boulder, they enter a set of programs that calibrate, vet and convert the raw data into standardized formats that scientists around the world can use and analyze. For example, temperature data from a sensor in Alaska would be adjusted using calibration information recorded when the sensor was last calibrated at NEON’s Calibration and Validation Laboratory. The calibrated data would then undergo a programmed limit check, a type of quality control that looks to see that the data fall within a reasonable range of values. Temperature readings of more than 100 degrees Fahrenheit coming from a site in the Alaskan tundra would be flagged at this step. Extraordinarily high temperature readings might indicate that the sensor in Alaska is malfunctioning, Tawa says. Or the high temperatures may be accurate indicators of a wildfire raging near the sensor. If the sensor is malfunctioning, NEON’s asset tracking system could be used to notify a technician to go out and replace it. If it turns out that the temperature sensor is functioning normally and recording an Alaskan heat wave, the temperature data may clear the limit check and go on to be packaged for public release via the web portal. The temperature data may also need to be averaged over a few days or plugged into a computer model before being released to the public as a data product.
Members of the CI team are working with NEON and the rest of the scientific community to determine which data formats will be the most useful and how the data should be transformed into those formats. Each piece of data follows a carefully defined route from field to finish. These routes are crucial legs of a quest to guarantee the quality of NEON. NEON scientists and engineers design requirements for data collection methods and equipment that help meet those standards. The scientists and engineers collaborate with the CI team to flesh out the exact data handling, equipment and construction procedures to meet those requirements. Codifying those procedures is a necessary prerequisite to creating cyberinfrastructure to support them. “You have to know exactly what you need to do first,” says NEON data engineer Scott Wiant. “That’s the hardest part, and the part that’s easiest to get wrong.” One of the biggest challenges for NEON is to find out and describe in detail the best possible ways to build and run the observatory and collect data. The main challenge for the CI team is to then create computational and data handling tools to support construction, operations and data collection as efficiently as possible. The reward is a streamlined cyberinfrastructure that NEON employees and equipment can plug into from day one to make high-quality NEON data available to the world as swiftly and efficiently as possible.