From data collected by thousands of automated sensors, to hundreds of field staff working through collection protocols, to airborne instruments collecting data during the flight season, NEON produces millions of data points every day. In order to make sure that the data can be ingested into NEON’s systems, processed, published, and eventually used, careful attention to how data are organized, named, and documented is critical. In this section, we introduce the basics about the data formats and conventions that NEON uses.
General Data Formats and Conventions
This section describes data formats and conventions that NEON uses across all products. For more information specific to data collection systems (observational, instrumented, and airborne), jump down to each Data Collection System section.
When a query for one data product, one or more sites, and a date range is submitted to the data portal or API, a downloadable data package is generated from a store of pre-published files and then zipped. Every package includes data files and documentation files. Each NEON data collection system structures data packages differently to maximize utility of the data, but generally there are separate pre-published file sets for each site and month, which are bundled into the download package. While data are provided in this granular way, our neonUtilities code package provides a straightforward and simple method to join files over multiple sites and months.
Data files are named using a series of component abbreviations separated by periods ( . ) or underscores ( _ ). Naming conventions for data files differ between NEON data collection systems to meet the needs of their dominant user groups. A file will have the same name whether it is accessed via the data portal or the API. For more information, read the NEON Data Product Numbering Convention and explore the tables below.
|An identifier that specifies that the data come from NEON.
|A three-character alphanumeric code, referring to the Domain of data acquisition (D01 - D20).
|A four-character code, referring to the site of data acquisition.
|A three-character alphanumeric code, referring to Data Product processing Level.
|A five-character numeric code, referring to the Data Product Number.
|A three-digit designation, referring to the revision number of the data product. The REV value is incremented by 1 each time a major change is made in instrumentation, data collection protocol, or data processing such that data from the preceding revision is not directly comparable to the new.
|A three-character alphanumeric code for the measurement locations within one horizontal plane. For example, if one surface measurement were made at each of five soil array plots, the number in the HOR field would range from 001-005.
|A three-character alphanumeric code for the measurement locations within one vertical plane. For example, if one temperature measurement is made at each vertical level of a tower with 8 levels, the number in the VER field would range from 010-080.
|A three-character alphanumeric code for the Temporal Index. Refers to the temporal representation, averaging period, or coverage of the data product. 000 = native resolution, 001 = native resolution or 1 minute, 002 = 2 minute, 005 = 5 minute, 015 = 15 minute, 030 = 30 minute, 060 = 60 minutes or 1 hour, 100 = instantaneous measurements, 101-103 = native resolution of replicate sensor 1, 2, and 3 respectively, 999 = measurements at varied interval.
|An abbreviated description of the data file or table.
|Represents the year and month of the data in the file.
|The type of data package downloaded. Options are 'basic', representing the basic download package, or 'expanded',representing the expanded download package (see more information below).
|The date-time stamp when the file was generated, in UTC. The format of the date-time stamp is YYYYMMDDTHHmmSSZ.
|The data release tag (e.g., "RELEASE-2021") or "PROVISIONAL" if not included in a data release.
|Year, last two digits only
|Indicator that the time stamp is beginning
|Universal Time Coordinated (Universal Coordinated Time), or UTC
Tabular Data - Wide vs. Long
Tabular data can generally be presented in a range of dimensions, from "long" or "narrow" format (many rows, few columns) to "wide" (many columns, fewer rows). Conventions vary both across and within disciplines. It is common to present data in long format when there are repeated measurements over one or only a few parameters over time, or wide format when there are numerous parameters being tracked. Formatting is often influenced by the desire to make the measured parameters more readable to the human eye (typically wide format, but sometimes long format for short tables) or to make data easier to subset by a machine (typically long format). For a brief overview of these two formatting options, see this Wikipedia page. NEON observational and instrumented data are published in tabular format; see below for long vs. wide details for each collection system.
The column headers that are used to describe variables, also often called field names or terms, are intentionally controlled to be easily understood and used in precise ways. Different data collection systems use somewhat different methods to generate and maintain variable names.
Dates and Times
Date and time fields, or time stamps, follow the ISO 8601 standard. In general, this means that a date (for example, January 23, 2020) is typically written out in the format YYYY-MM-DD (2020-01-23). Times are typically added to a date in the format HH:mm:ss'Z', where the Z indicates UTC, or Coordinated Universal Time. UTC is appropriate for NEON data, as NEON field sites are spread across multiple time zones. The format for any given date-time field may be found within the metadata file supplied in a data package. This definition may also contain any rounding (e.g., 'floor' or 'ceiling') that was used on a time stamp.
Data flags can be used to help guide decisions on whether data are fit for specific uses. A tailored suite of quality tests is applied to each data product, and the results are provided along with data values in the form of quality flags and metrics.
The Document Library is a rich resource of information about our data products, including overarching science designs, site characterization reports, spatial data, field protocols, data processing documentation, and user guides. Relevant documents are linked to each data product's detail page, which may be accessed via the Explore Data Products page. Data packages also contain the documents that are relevant to the data product.
The following documents are available for most data products, regardless of data collection system:
Science Designs: The science design documents provide the background and strategy used for data collection. They frequently bridge related data products.
Algorithm Theoretical Basis Document (ATBD): A full explanation of the algorithms used to process data. Each ATBD details the scientific theory behind the measurement, relevant processing algorithms, as well as the steps taken to determine uncertainty and to perform quality control/quality assurance. Some ATBDs are specific to a data product, while others describe algorithms applied to many data products.
Each data collection system also has documents specific to it. To learn more, skip to the Specific Information by Data Collection System section below.
Each data package contains one or more types of metadata files that may be either human or machine readable. The consistent metadata file provided with all data products is the README file, and EML files are provided for most data products.
README file: A short summary of the data downloaded. Includes a brief description of the data product, file naming conventions, and issue log. Much of the information in the readme can also be found on the Data Product Detail pages on the Data Portal.
EML: Contains machine-readable metadata about the data using the Ecological Metadata Language (EML). EML is a widely used, community supported XML schema that supports rich documentation of data related to ecological research, particularly including environmental, ecological, and earth science data. EML files, which are served with the extension .xml, include site location, data policy, and variable definitions and units. EML files also contain a few pieces of information included nowhere else, such as precision values and time stamp formats.
Instrument Systems (IS) Data
A single field site often has multiple sensors of the same type, each at a different location. One example is soil moisture sensors along an array. A separate data table is provided for each combination of position and calculated averaging interval. An example of a table name is 2DWSD_2min, which translates to 2D Wind Speed and Direction at a 2 minute averaging interval.
The level of granularity for most IS data products is one data file per data product, site, month, vertical and horizontal position of the sensor collecting the data, and calculated averaging interval (most instrumented data are provided at one- and thirty-minute averages). Data files are named using the following pattern:
One example of this file name would be:
This indicates 2D Wind Speed and Direction at site TOOL (Toolik), collected at tower level one in March of 2019, and that data were averaged over every two-minute interval. The data file itself was generated on April 22, 2019 at 20:50:21 UTC.
NEON IS data are provided as comma separated values (CSV) files. The tables are long with respect to time, but wide for all other variables; that is, there is a row in the table for every time stamp that contains all values measured at that time. The neonUtilities R package, when applied to IS data, keeps the long format with respect to time, and also combines data collected at different locations in long format, so that there is a row in the table for each time stamp by location combination.
Every term is unique; it has a unique name, definition, unit, and data type. For example, the term "decimalLatitude" has the unique definition, "The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area", the unit "decimalDegree", and data type "string". Terms may not be used more than once within the same data product table. However, terms may be reused between different data products if the meaning is exactly the same. Wherever that term is used, the definition, unit, and data type are exactly the same. Variable names and their definitions can be found in the variables file in the data downloads.
The NEON variable names for the IS datasets that are submitted to AmeriFlux are mapped to the Ameriflux data variable names. IS data not submitted to AmeriFlux are not formally mapped to any community standard vocabularies or ontologies.
Quantitative, or numeric variables : These are generally restricted to be integers or real values. Where needed, real values may be rounded to the significant figure appropriate for the data. Precision values may be found within the EML (.xml) file supplied in a data package.
Missing values : A blank cell indicates that data were not collected, data were filtered by automated quality control procedures, or in rare cases, data were redacted. Redacted data are also indicated by a quality flag. Some software, such as Excel or R, may interpret blank values in different ways, most frequently as "NA".
Data quality flags may indicate that a data value was outside of a typical range of values or that some quality check didn't pass. Each specific flag is well defined by its term description, found in the EML and variables files.
Data quality metrics are summaries of data quality flags for data values that were derived from an aggregate of values, such as for flags on 30-minute values that were aggregated from 1-minute values.
Final quality flags aggregate the results of several quality tests into a final indication of whether each data value is trustworthy or questionable.
Science review flags indicate whether a NEON scientist has, in a specific review of data, recommended that data be marked as questionable or removed.
For more information about quality flags, particularly those used by IS, please refer to the Data Quality Program page and the ATBD: Quality Flags and Quality Metrics for TIS Data Products.
In addition to Science Designs and ATBDs, IS data products are documented by the following document types:
Sensor Command, Control, and Configuration (C3) Document: Specifies the command, control, and configuration details for operating the relevant sensor and its assembly. It includes a detailed discussion of all necessary requirements for operational control parameters, conditions/constraints, set points, and any necessary error handling.
NEON Preventive Maintenance Procedure: Specifies a list and schedule of checks and actions that NEON personnel perform on the relevant sensor and its assembly to ensure its proper operation. Detailed instructions are provided for more complex tasks.
In addition to the README and EML file, IS data products are supplied with:
Sensor position file: Contains the positions of the sensors relative to a reference location, as well as the reference location coordinates.
Variables file: Contains variable definitions and units for each column in each table of the data product.
Surface Atmosphere Exchange (SAE) Data
Surface-Atmosphere Exchange data are delivered as the "Bundled Data Products - Eddy Covariance" data product in the Hierarchical Data Format (HDF5) as a 'bundle' of many data products that are delivered together. Similar to other instrumented data products, each zip file within the downloaded zip contains data for a single site and month.
The basic package contains monthly data files, while the expanded package contains daily data files with all the information that is provided in the basic files along with half-hourly footprint weight matrices and additional data quality information. The naming convention for the downloadable package, and the monthly or daily files contained within the package, are as follows:
- Folder: NEON.DOM.SITE.DP4.00200.001.YYYY-MM.PKGTYPE.GENTIME.RELEASE
- Monthly data files: NEON.DOM.SITE.DP4.00200.001.nsae.YYYY-MM.PKGTYPE.GENTIME.h5.gz
- Daily data files: NEON.DOM.SITE.DP4.00200.001.nsae.YYYY-MM-DD.PKGTYPE.GENTIME.h5.gz
HDF5 is similar to NETCDF in that it is a compressed, self-describing file format that allows for hierarchical structuring of data. Each HDF5 contains numerous data tables. We recommend using the HDFView tool provided as a free download from the HDF Group, but we also provide the
stackEddy() function in our neonUtilities code package to join data tables across sites and months. Alternatively the rhdf5 and h5py packages provide functions to interface with the files in the R and Python languages, respectively.
Variable definitions are provided in the NSAE_HDF5_object_description.csv file that is downloaded with the data product as well as in the objDesc table within the HDF5 file. Here, term names are more generic than in IS files with terms such as max and mean repeated for each data product and nsae, turb, and stor repeated for each scalar. Fully descriptive terms are derived from the location of the dataset in the HDF5 file structure (e.g. the nsae dataset in the fluxCo2 group contains net surface atmosphere exchange CO2 flux data). The variables are named to help streamline the complex code involved in processing the data from raw (level 0) to highly derived (level 4) products.
Missing Values: "NaN" is used to represent missing data, and "NA" to represent missing metadata. Missing timestamps in monthly files are due to processing failures in daily file generation.
Data quality flags, data quality metrics, final quality flags, and science review flags are similar to other IS data products, but rather than being embedded within the data tables, they are instead included in the qfqm group within the HDF5 file.
Same as IS data products.
In addition to the typical README and EML files that are provided with all IS data products, the HDF5 files are self describing, and metadata about the file structure and format are embedded within as group and dataset attributes.
Observation Systems (OS) Data
Within each monthly zip package, there may be multiple data tables. The level of granularity for each data table is a type of data collection activity. For example, the Ground Beetles Sampled From Pitfall Traps data product includes individual tables for field data, sorting, initial identification, and later expert identification if needed. A file containing metadata about data validation is also included. The naming convention for OS data files is:
Table names are descriptive. For example, the ground beetle table containing field data from sampling events may be named:
while a table with data from sorting the field catch later in the lab is named:
And so on for other tables that are produced throughout the entire processing chain.
Data tables are published when they are available. Downloads of recent data may not include tables containing data with longer processing times, such as the identification data for beetles - field data are available much sooner than expert identification. For more details, see the Data Availability page.
NEON OS data are always long with respect to time, but may be long or wide for other variables depending on the needs of different data products. For example, sediment chemistry is provided in long format, with a row in the table for every chemical analyte. To understand the formatting of the tables in each OS data product, consult the Data Product User Guide.
OS data tables can follow any of four publication models, depending on the resolution and specificity of the data. The options are:
- site-date: Data published in each site by month file are data that were collected at that site, during that month. The large majority of OS data tables follow this model; all IS tables follow this model.
- site-all: All available data for a given site are published in every site by month file for the relevant site. This is typically done when contextual data are collected once or infrequently, but are needed to interpret all future data. For example, the trap establishment data for litter traps are published this way.
- lab-all: All available data for a given analytical laboratory are published in every site by month file. This is done for data that are specific to a lab, rather than to a sampling location. For example, many labs provide data from analyses of standards run as unknowns along with samples; these data may be relevant to NEON samples from a wide variety of sites and sampling events.
- lab-current: The most recently ingested data for a given analytical laboratory are published in every site by month file. In this case the appearance to users is essentially the same as in lab-all, although data handling internal to NEON differs slightly.
Recommended practices: to ensure use of the most up-to-date data and avoid duplication, users should consult the publication date stamp on downloaded files, and work with the most recently published lab files, and the most recently published file for each site for site-all files. The data stacking function in the neonUtilities package does this silently, retaining the most up to date files and discarding the others. Information about the publishing model for each table can be found in the table_types table in neonUtilities, or on the Data Product Details page for each data product.
Every term is unique; it has a unique name, definition, unit, and data type. For example, the term "decimalLatitude" has the unique definition, "The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area", the unit "decimalDegree", and data type "string". Terms may not be used more than once within the same data product table. However, terms may be reused between different data products if the meaning is exactly the same. Wherever that term is used, the definition, unit, and data type are exactly the same. Variable names and their definitions can be found in the
variables file in the data downloads.
For categorical variables in OS data products, terms may be used in conjunction with Lists of Values (LOVs), which describe a fixed list of potential values that may be used for a term within a data product. Having terms with pre-defined lists of possible values aids in accurate data entry and also assists end users with standard values.
The name of the LOV for any given term may be found within the
categoricalCodeName column of the
variables file found in each monthly zip package. All values for all LOVs used may be found in the
categoricalCodes file. LOV usage may also be found in the EML file supplied in a data package.
For biological data, some variable names have been standardized with Darwin Core terms, the Global Biodiversity Information Facility vocabularies, and the VegCore data dictionary, where applicable. For genomic data, some field names have been standardized with the Minimum Information about any (x) Sequence (MIxS) along with several of the environmental package extensions.
Free-Text Variables: For free-text fields, few restrictions are applied other than avoiding the use of special characters that may render poorly in some text editors.
Missing Values: In all tabular data, a blank cell indicates data were not collected, or in rare cases, data were lost or redacted. Redacted data are also indicated by a quality flag. Some software, such as Excel or R, may interpret blank values in different ways, most frequently as "NA". Periodically, we may find errors and fill in missing cells on an ad hoc basis.
For some OS data products with long latency times between data collection and publication (due to necessary processing or analytical steps), tables may be initially published with some values missing, and later republished as more data become available. To learn more, read the documentation associated with each data product.
Quality flags in data tables are unique by data product and table. For example, the
bet_fielddata table mentioned above contains several variables that can be used to assess the quality of the data - sampleCompromised has values of 'OK', 'handling error', 'damaged, analysis affected', 'sample incomplete', and 'other (described in remarks)'. Use the
categoricalCodes files to help assess data quality.
In addition to Science Designs and ATBDs, OS data products are documented by the following document types:
Data Product User Guide : A brief summary of the sampling design and the structure of the published data. In some cases a single User Guide may cover multiple closely related data products.
Protocols: The protocols used by field scientists to carry out sampling and measurements. Protocols are generally modified over time as recommended by our Technical Working Groups or our Science or Field Science staff. Each OS data table will specify which protocol version was used to collect each datum. All versions are available in our Document Library.
In addition to the README and EML files, OS data products are supplied with:
Validation file: Contains validation rules applied to each variable during data entry and ingest. Data entry constraints are described in NEON's Ingest Conversion Language (NICL) syntax. A general description of NICL is available in Nicl_Language.pdf, while each function is more thoroughly described in nicl_function_library.pdf.
Variables file: Contains variable definitions and units for each column in each table of the data product.
Categorical Codes file: Contains the possible list of values for each categorical variable. It includes time stamps for each value, as any given list may change over time.
Airborne Observation Platform (AOP)
The AOP is flown over each site usually no more than once per year. AOP data files are organized by data product and site (sometimes two sites if they are spatially contiguous to one another), and year of collection. The data portal and API allow for a granular approach to downloading data files because individual files can exceed 10s of GB for some AOP data products. AOP file naming conventions depend on the data product and contain abbreviations unique to AOP. The AOP data files have an additional set of abbreviations in their naming convention:
|Date of flight, YYYYMMDD
|Start time of flight, YYYYMMDDHH
|Start time of flight, YYMMDDHH
|Year of flight
|Date and time of image capture, YYYYMMDDHHmmSS
|Digital camera serial number
|Sequential number for indexing files
|Planned flightline number
|Numeric code for an individual flightline
|UTM easting of lower left corner of the tile, in meters
|UTM northing of lower left corner of the tile, in meters
AOP files are also named using patterns that are different from IS and OS data products (Table 4).
|File Name Structure
|Varies with the year, camera model, and payload - see Table x
|Discrete return lidar, unclassified
|Discrete return lidar, classified
|NEON_DOM_SITE_DPL_LNNN-R_FLIGHTSTRT_DESC.wvz (or .plz)
|NEON_DOM_SITE_DPL_FLHTDATE_FFFFFF_DESC.zip (or .tif)
|Spectrometer / Lidar / Camera
The filenames of the L1 camera images vary with the year, camera model, and payload. Examples of the image filenames are given in Table 5. The L1 filename gives no indication of the location of the image, but this can be either obtained from the file metadata or from a KMZ file distributed with the files.
Note that the year, month, day, and hour are all encoded in some way in each filename. The serial number of the camera is sometimes included (e.g. "EH021537" and "C0119"). The other numbers are identifiers generated by the camera acquisition software.
The L1 surface reflectance and at-sensor radiance, as well as L3 surface reflectance image data are stored in an HDF5 (with extension H5) file format that also includes extensive metadata and data quality information. The HDF5 format was selected because of the flexibility it allows in storing supplementary metadata and data quality information in different formats. However, the flexibility in the format inhibits storage of the data in a standardized structure. As a result, the HDF5 files cannot be read by standard geospatial software packages without a dedicated reader. NEON provides coed examples for working with AOP HDF5 data in Python and R as part of our educational data tutorials. All other image data are stored in the OGC Standard Geotiff format commonly used within the remote sensing community.
The L1 lidar point clouds are stored in a LAZ format. LAZ is a lossless compression of the LAS format, identified by the file extension on the file. LAS is the officially adopted lidar file exchange format by the American Society of Photogrammetry and Remote Sensing (ASPRS), and can be read by all commercial lidar software packages. LAZ can also be directly read by some software packages, but if LAS is required, a conversion tool that decompresses LAZ to LAS can be found at https://rapidlasso.com/lastools/.
The L1 waveform lidar product is stored in a compressed version of the Pulsewaves format. In the PulseWaves file format the data is stored in two different files, with a PLS and WVS file extension. The PLS file stores metadata and geolocation information for each outgoing and returned laser pulse, while the WVS file contains the outgoing and return recorded signals. The compressed files contain a PLZ and WVZ extension. A free utility named pules2pulse, available at https://rapidlasso.com/pulsewaves/, can be used to decompress the PLZ and WVZ files to PLS and WVS.
The L1 and L3 surface reflectance data products provide the image data in values scaled by 10000. To achieve a percent reflectance between 0 and 100, these values provided in the image data must be divided by 100, and to achieve a ratio between 0 and 1, divided by 10000. In data delivered between 2013 and 2019, the L1 at-sensor radiance image data is provided in floating point. In data after 2019, data are delivered in two separate images. The first contains the integer portion of the at-sensor radiance value, and the second provides the decimal part scaled by 50000. To achieve the observed at-sensor radiance value, the integer portion must be added to the decimal portion, after the decimal portion is divided by 50000. All L2 spectrometer data and L3 lidar image data is delivered as floating point values, while L1 and L3 camera data are delivered as 8 bit integers. Therefore, all images other than camera data will store 'no data' values as -9999 which is a convention in the remote sensing community. Since the 8-bit camera images are unable to represent -9999 using 8 bit integers, all 'no data' values are stored as zeros.
Spectrometer products are flagged through a series of ancillary rasters which report quality information of the image data and are delivered within the H5 files. For example, in the L3 surface reflectance H5 file, an ancillary raster is provided which provides information on the observed weather conditions during data collection. The weather quality raster is a three band color image that has values corresponding to 'Green' for 0-10% cloud cover, 'Yellow' for 10-50% cloud cover and 'Red' for >50% cloud cover. Additional ancillary rasters that are provided in the H5 files include the solar elevation angle, and atmospheric conditions such as water vapor content. In L2 spectrometer products, objects such as man-made surfaces are identified which exhibit high reflectance and cause numerical instabilities in the calculations of vegetation or water indexes. Due to the numeric instabilities, we identify these values as 'no data' because the anomalous values can cause difficulty when visualizing the data or determining global statistics of the image data.
The L1 lidar point cloud contains flags in the classification attribute of the LAS file. Classifications follow the numbering guidelines recommended by the ASPRS LAS file specification. As such, points that have been identified as 'noise' are given classification integer '7', and points that could not be definitely classified as either 'ground', 'low / medium / high vegetation', 'building', or 'noise', have been given the classification 'unclassified' which is represented by the integer '1'. In addition to LAZ files that contain the standard X,Y,Z coordinate tuples that construct the point cloud, two full additional sets of LAZ files are provided which contain the X,Y,errZ and the X,Y,errH respectively. Here, the err term refers to a simulated error, and Z represents the vertical coordinate and H the horizontal coordinate. These errors can be used to identify points that may be outside tolerance for applications of interest.
In addition to science designs and ATBDs, AOP data products are supplied with:
Data Processing Quality Assurance (QA) Document: a summary of the data quality metrics used to assess the validity of the AOP data products, as well as information on flight acquisition parameters, and processing parameters. These documents are optionally delivered with the data products for which they are applicable - they must be selected in order to be downloaded. Look for PDF files in the download workflow.
Goulden, T., 2014. NEON Discrete Lidar Datum Reconciliation Report, NEON.DOC.002293vA
Goulden, T. & Kampe, T. U. NEON AOP Surveys of City of Boulder Pre and Post 2013 Flood Event.
In addition to the README file, AOP camera data products are supplied with:
KML/KMZ/SHP: Keyhole Markup Language (KML) files, which can be zipped into KMZ files, as well as ESRI shapefiles (SHP), document the boundaries of flight lines or camera images that were acquired, along with quality information such as observed cloud cover. When camera data are downloaded from NEON, KMZ files are included. KMZ files can be opened in GoogleEarthPro and contain L1 and L3 image boundaries and filenames, overlaid on a low-resolution image mosaic. The KMZ file facilitates locating individual camera images of interest. The KMZ filename is as follows; YYYY_SITE_V_mosaic.kmz, e.g. 2019_YELL_2_mosaic.kmz. The file will be downloaded to a dedicated Metadata directory within the zipped data package obtained from the portal. Within the KMZ, the low resolution browse image is shown and the tile boundaries are displayed in red, with a red thumbtack in the middle of each tile. Clicking on a thumbtack displays the full filename for that tile.Additionally, KMLs and SHP files of the 1 km by 1 km mosaic boundaries can be downloaded with the L3 lidar data or L3 camera data. These files can be opened and viewed in Google Earth, ESRI ArcGIS and other spatial software packages.