Series
Get Started with NEON Data: A Series of Data Tutorials
This Data Tutorial Series is designed to provide you with an introduction to how to access and use NEON data. It includes both foundational skills for working with NEON data, and tutorials that focus on specific data types, which you can choose from based on your interests.
Foundational Skills and Tools to Access NEON Data
- Start with a short video guide to downloading data from the NEON Data Portal.
- The Download and Explore NEON Data tutorial will guide you through using the neonUtilities package in R and/or the neonutilities package in Python, to download and transform NEON data and to use the metadata that accompany data downloads to help you understand the data.
- Many more details about the neonUtilities package, and the input parameters available for its functions, are available in the neonUtilities cheat sheet.
- Using an API token can make your downloads faster, and helps NEON by linking your user account to your downloads. See more information about API tokens here, and learn how to use a token with neonUtilities in this tutorial.
- NEON data are initially published as Provisional and are subject to change; each year NEON publishes a data Release that does not change and can be cited by DOIs. Learn about working with Releases and Provisional data.
- Learn how to work with NEON location data, using examples from vegetation structure observations and soil temperature sensors.
Introductions to Working with Different Data Types
- Explore the intersection of sensor and observational data with the Plant Phenology & Temperature tutorial series (individual tutorials that make up the series are listed in the sidebar). This is also a good introduction for inexperienced R users.
- Get familiar with NEON sensor data flagging and data quality metrics, using aquatic instrument data as exemplar datasets.
- Use the neonOS package to join tables and flag duplicates in NEON observational data.
- Calculate biodiversity metrics from NEON aquatic macroinvertebrate data.
- For a quick introduction to working with remote sensing data, calculate a canopy height model from discrete return Lidar. NEON has an extensive catalog of tutorials about remote sensing principles and data; search the tutorials and tutorial series if you are interested in other topics.
- Connecting ground observations to remote sensing imagery is important to many NEON users; get familiar with the process, as well as some of the challenges of comparing these data sources, by comparing tree height observations to a canopy height model.
- Use the neonUtilities package to wrangle NEON surface-atmosphere exchange data (published in HDF5 format).
Download and Explore NEON Data
Authors: Claire K. Lunch
Last Updated: Oct 1, 2024
This tutorial covers downloading NEON data, using the Data Portal and either the neonUtilities R package or the neonutilities Python package, as well as basic instruction in beginning to explore and work with the downloaded data, including guidance in navigating data documentation. We will explore data of 3 different types, and make a simple figure from each.
NEON data
There are 3 basic categories of NEON data:
- Remote sensing (AOP) - Data collected by the airborne observation platform, e.g. LIDAR, surface reflectance
- Observational (OS) - Data collected by a human in the field, or in an analytical laboratory, e.g. beetle identification, foliar isotopes
- Instrumentation (IS) - Data collected by an automated, streaming sensor, e.g. net radiation, soil carbon dioxide. This category also includes the surface-atmosphere exchange (SAE) data, which are processed and structured in a unique way, distinct from other instrumentation data (see the introductory eddy flux data tutorial for details).
This lesson covers all three types of data. The download procedures are similar for all types, but data navigation differs significantly by type.
Objectives
After completing this activity, you will be able to:
- Download NEON data using the neonUtilities package.
- Understand downloaded data sets and load them into R or Python for analyses.
Things You’ll Need To Complete This Tutorial
You can follow either the R or Python code throughout this tutorial. * For R users, we recommend using R version 4+ and RStudio. * For Python users, we recommend using Python 3.9+.
Set up: Install Packages
Packages only need to be installed once, you can skip this step after the first time:
R
- neonUtilities: Basic functions for accessing NEON data
- neonOS: Functions for common data wrangling needs for NEON observational data.
- terra: Spatial data package; needed for working with remote sensing data.
install.packages("neonUtilities")
install.packages("neonOS")
install.packages("terra")
Python
- neonutilities: Basic functions for accessing NEON data
- rasterio: Spatial data package; needed for working with remote sensing data.
pip install neonutilities
pip install rasterio
Additional Resources
- GitHub repository for neonUtilities
- neonUtilities cheat sheet. A quick reference guide for users.
Set up: Load packages
R
library(neonUtilities)
library(neonOS)
library(terra)
Python
import neonutilities as nu
import os
import rasterio
import pandas as pd
import matplotlib.pyplot as plt
Getting started: Download data from the Portal
Go to the NEON Data Portal and download some data! To follow the tutorial exactly, download Photosynthetically active radiation (PAR) (DP1.00024.001) data from September-November 2019 at Wind River Experimental Forest (WREF). The downloaded file should be a zip file named NEON_par.zip.
If you prefer to explore a different data product, you can still follow this tutorial. But it will be easier to understand the steps in the tutorial, particularly the data navigation, if you choose a sensor data product for this section.
Once you’ve downloaded a zip file of data from the portal, switch over to R or Python to proceed with coding.
Stack the downloaded data files: stackByTable()
The stackByTable()
(or stack_by_table()
)
function will unzip and join the files in the downloaded zip file.
R
# Modify the file path to match the path to your zip file
stackByTable("~/Downloads/NEON_par.zip")
Python
# Modify the file path to match the path to your zip file
nu.stack_by_table(os.path.expanduser("~/Downloads/NEON_par.zip"))
In the directory where the zipped file was saved, you should now have an unzipped folder of the same name. When you open this you will see a new folder called stackedFiles, which should contain at least seven files: PARPAR_30min.csv, PARPAR_1min.csv, sensor_positions.csv, variables_00024.csv, readme_00024.txt, issueLog_00024.csv, and citation_00024_RELEASE-202X.txt.
Navigate data downloads: IS
Let’s start with a brief description of each file. This set of files is typical of a NEON IS data product.
- PARPAR_30min.csv: PAR data at 30-minute averaging intervals
- PARPAR_1min.csv: PAR data at 1-minute averaging intervals
- sensor_positions.csv: The physical location of each sensor collecting PAR measurements. There is a PAR sensor at each level of the WREF tower, and this table lets you connect the tower level index to the height of the sensor in meters.
- variables_00024.csv: Definitions and units for each data field in the PARPAR_#min tables.
- readme_00024.txt: Basic information about the PAR data product.
- issueLog_00024.csv: A record of known issues associated with PAR data.
- citation_00024_RELEASE-202X.txt: The citation to use when you publish a paper using these data, in BibTeX format.
We’ll explore the 30-minute data. To read the file, use the function
readTableNEON()
or read_table_neon()
, which
uses the variables file to assign data types to each column of data:
R
par30 <- readTableNEON(
dataFile="~/Downloads/NEON_par/stackedFiles/PARPAR_30min.csv",
varFile="~/Downloads/NEON_par/stackedFiles/variables_00024.csv")
head(par30)
Python
par30 = nu.read_table_neon(
data_file=os.path.expanduser(
"~/Downloads/NEON_par/stackedFiles/PARPAR_30min.csv"),
var_file=os.path.expanduser(
"~/Downloads/NEON_par/stackedFiles/variables_00024.csv"))
# Open the par30 table in the table viewer of your choice
The first four columns are added by stackByTable()
when
it merges files across sites, months, and tower heights. The column
publicationDate
is the date-time stamp indicating when the
data were published, and the release
column indicates which
NEON data release the data belong to. For more information about NEON
data releases, see the
Data
Product Revisions and Releases page.
Information about each data column can be found in the variables file, where you can see definitions and units for each column of data.
Plot PAR data
Now that we know what we’re looking at, let’s plot PAR from the top tower level. We’ll use the mean PAR from each averaging interval, and we can see from the sensor positions file that the vertical index 080 corresponds to the highest tower level. To explore the sensor positions data in more depth, see the spatial data tutorial.
R
plot(PARMean~endDateTime,
data=par30[which(par30$verticalPosition=="080"),],
type="l")
Python
par30top = par30[par30.verticalPosition=="080"]
fig, ax = plt.subplots()
ax.plot(par30top.endDateTime, par30top.PARMean)
plt.show()
Looks good! The sun comes up and goes down every day, and some days are cloudy.
Plot more PAR data
To see another layer of data, add PAR from a lower tower level to the plot.
R
plot(PARMean~endDateTime,
data=par30[which(par30$verticalPosition=="080"),],
type="l")
lines(PARMean~endDateTime,
data=par30[which(par30$verticalPosition=="020"),],
col="blue")
Python
par30low = par30[par30.verticalPosition=="020"]
fig, ax = plt.subplots()
ax.plot(par30top.endDateTime, par30top.PARMean)
ax.plot(par30low.endDateTime, par30low.PARMean)
plt.show()
We can see there is a lot of light attenuation through the canopy.
Download files and load directly to R: loadByProduct()
At the start of this tutorial, we downloaded data from the NEON data
portal. NEON also provides an API, and the neonUtilities
packages provide methods for downloading programmatically in R.
The steps we carried out above - downloading from the portal,
stacking the downloaded files, and reading in to R or Python - can all
be carried out in one step by the neonUtilities function
loadByProduct()
.
To get the same PAR data we worked with above, we would run this line
of code using loadByProduct()
:
R
parlist <- loadByProduct(dpID="DP1.00024.001",
site="WREF",
startdate="2019-09",
enddate="2019-11")
Python
parlist = nu.load_by_product(dpid="DP1.00024.001",
site="WREF",
startdate="2019-09",
enddate="2019-11")
Explore loaded data
The object returned by loadByProduct()
in R is a named
list, and the object returned by load_by_product()
in
Python is a dictionary. The objects contained in the list or dictionary
are the same set of tables we ended with after stacking the data from
the portal above. You can see this by checking the names of the tables
in parlist
:
R
names(parlist)
## [1] "citation_00024_RELEASE-2024" "issueLog_00024"
## [3] "PARPAR_1min" "PARPAR_30min"
## [5] "readme_00024" "sensor_positions_00024"
## [7] "variables_00024"
Python
parlist.keys()
## dict_keys(['PARPAR_1min', 'PARPAR_30min', 'citation_00024_RELEASE-2024', 'issueLog_00024', 'readme_00024', 'sensor_positions_00024', 'variables_00024'])
Now let’s walk through the details of the inputs and options in
loadByProduct()
.
This function downloads data from the NEON API, merges the site-by-month files, and loads the resulting data tables into the programming environment, assigning each data type to the appropriate class. This is a popular choice for NEON data users because it ensures you’re always working with the latest data, and it ends with ready-to-use tables. However, if you use it in a workflow you run repeatedly, keep in mind it will re-download the data every time. See below for suggestions on saving the data locally to save time and compute resources.
loadByProduct()
works on most observational (OS) and
sensor (IS) data, but not on surface-atmosphere exchange (SAE) data,
remote sensing (AOP) data, and some of the data tables in the microbial
data products. For functions that download AOP data, see the final
section in this tutorial. For functions that work with SAE data, see the
NEON
eddy flux data tutorial.
The inputs to loadByProduct()
control which data to
download and how to manage the processing:
dpID
: the data product ID, e.g. DP1.00002.001site
: defaults to “all”, meaning all sites with available data; can be a vector of 4-letter NEON site codes, e.g.c("HARV","CPER","ABBY")
(or["HARV","CPER","ABBY"]
in Python)startdate
andenddate
: defaults to NA, meaning all dates with available data; or a date in the form YYYY-MM, e.g. 2017-06. Since NEON data are provided in month packages, finer scale querying is not available. Both start and end date are inclusive.package
: either basic or expanded data package. Expanded data packages generally include additional information about data quality, such as chemical standards and quality flags. Not every data product has an expanded package; if the expanded package is requested but there isn’t one, the basic package will be downloaded.timeIndex
: defaults to “all”, to download all data; or the number of minutes in the averaging interval. Only applicable to IS data.release
: Specify a NEON data release to download. Defaults to the most recent release plus provisional data. See the release tutorial for more information.include.provisional
: T or F: should Provisional data be included in the download? Defaults to F to return only Released data, which are citable by a DOI and do not change over time. Provisional data are subject to change.check.size
: T or F: should the function pause before downloading data and warn you about the size of your download? Defaults to T; if you are using this function within a script or batch process you will want to set it to F.token
: Optional NEON API token for faster downloads. See this tutorial for instructions on using a token.nCores
: Number of cores to use for parallel processing. Defaults to 1, i.e. no parallelization. Only available in R.forceParallel
: If the data volume to be processed does not meet minimum requirements to run in parallel, this overrides. Only available in R.progress
: Set to False to turn off the progress bar. Only available in Python.cloud_mode
: Can be set to True if you are working in a cloud environment; enables more efficient data transfer from NEON’s cloud storage. Only available in Python.
The dpID
is the data product identifier of the data you
want to download. The DPID can be found on the
Explore Data Products page. It will be in the form DP#.#####.###
Download observational data
To explore observational data, we’ll download aquatic plant chemistry data (DP1.20063.001) from three lake sites: Prairie Lake (PRLA), Suggs Lake (SUGG), and Toolik Lake (TOOK).
R
apchem <- loadByProduct(dpID="DP1.20063.001",
site=c("PRLA","SUGG","TOOK"),
package="expanded",
release="RELEASE-2024",
check.size=F)
Python
apchem = nu.load_by_product(dpid="DP1.20063.001",
site=["PRLA", "SUGG", "TOOK"],
package="expanded",
release="RELEASE-2024",
check_size=False)
Navigate data downloads: OS
As we saw above, the object returned by loadByProduct()
is a named list of data frames. Let’s check out what’s the same and
what’s different from the IS data tables.
R
names(apchem)
## [1] "apl_biomass" "apl_clipHarvest"
## [3] "apl_plantExternalLabDataPerSample" "apl_plantExternalLabQA"
## [5] "asi_externalLabPOMSummaryData" "categoricalCodes_20063"
## [7] "citation_20063_RELEASE-2024" "issueLog_20063"
## [9] "readme_20063" "validation_20063"
## [11] "variables_20063"
Python
apchem.keys()
## dict_keys(['apl_biomass', 'apl_clipHarvest', 'apl_plantExternalLabDataPerSample', 'apl_plantExternalLabQA', 'asi_externalLabPOMSummaryData', 'categoricalCodes_20063', 'citation_20063_RELEASE-2024', 'issueLog_20063', 'readme_20063', 'validation_20063', 'variables_20063'])
Explore tables
As with the sensor data, we have some data tables and some metadata tables. Most of the metadata files are the same as the sensor data: readme, variables, issueLog, and citation. These files contain the same type of metadata here that they did in the IS data product. Let’s look at the other files:
- apl_clipHarvest: Data from the clip harvest collection of aquatic plants
- apl_biomass: Biomass data from the collected plants
- apl_plantExternalLabDataPerSample: Chemistry data from the collected plants
- apl_plantExternalLabQA: Quality assurance data from the chemistry analyses
- asi_externalLabPOMSummaryData: Quality metrics from the chemistry lab
- validation_20063: For observational data, a major method for ensuring data quality is to control data entry. This file contains information about the data ingest rules applied to each input data field.
- categoricalCodes_20063: Definitions of each value for categorical data, such as growth form and sample condition
You can work with these tables from the named list object, but many
people find it easier to extract each table from the list and work with
it as an independent object. To do this, use the list2env()
function in R or globals().update()
in Python:
R
list2env(apchem, .GlobalEnv)
## <environment: R_GlobalEnv>
Python
globals().update(apchem)
Save data locally
Keep in mind that using loadByProduct()
will re-download
the data every time you run your code. In some cases this may be
desirable, but it can be a waste of time and compute resources. To come
back to these data without re-downloading, you’ll want to save the
tables locally. The most efficient option is to save the named list in
total.
R
saveRDS(apchem,
"~/Downloads/aqu_plant_chem.rds")
Python
# There are a variety of ways to do this in Python; NEON
# doesn't currently have a specific recommendation. If
# you don't have a data-saving workflow you already use,
# we suggest you check out the pickle module.
Then you can re-load the object to a programming environment any time.
Other options for saving data locally:
- Similar to the workflow we started this tutorial with, but using
neonUtilities
to download instead of the Portal: UsezipsByProduct()
andstackByTable()
instead ofloadByProduct()
. With this option, use the functionreadTableNEON()
to read the files, to get the same column type assignment thatloadByProduct()
carries out. Details can be found in our neonUtilities tutorial. - Try out the community-developed
neonstore
package, which is designed for maintaining a local store of the NEON data you use. TheneonUtilities
functionstackFromStore()
works with files downloaded byneonstore
. See the neonstore tutorial for more information.
Now let’s explore the aquatic plant data. OS data products are simple in that the data generally tabular, and data volumes are lower than the other NEON data types, but they are complex in that almost all consist of multiple tables containing information collected at different times in different ways. For example, samples collected in the field may be shipped to a laboratory for analysis. Data associated with the field collection will appear in one data table, and the analytical results will appear in another. Complexity in working with OS data usually involves bringing data together from multiple measurements or scales of analysis.
As with the IS data, the variables file can tell you more about the data tables.
OS data products each come with a Data Product User Guide, which can be downloaded with the data, or accessed from the document library on the Data Portal, or the Product Details page for the data product. The User Guide is designed to give a basic introduction to the data product, including a brief summary of the protocol and descriptions of data format and structure.
Explore isotope data
To get started with the aquatic plant chemistry data, let’s take a
look at carbon isotope ratios in plants across the three sites we
downloaded. The chemical analytes are reported in the
apl_plantExternalLabDataPerSample
table, and the table is
in long format, with one record per sample per analyte, so we’ll subset
to only the carbon isotope analyte:
R
boxplot(analyteConcentration~siteID,
data=apl_plantExternalLabDataPerSample,
subset=analyte=="d13C",
xlab="Site", ylab="d13C")
Python
apl13C = apl_plantExternalLabDataPerSample[
apl_plantExternalLabDataPerSample.analyte=="d13C"]
grouped = apl13C.groupby("siteID")["analyteConcentration"]
fig, ax = plt.subplots()
ax.boxplot(x=[group.values for name, group in grouped],
tick_labels=grouped.groups.keys())
plt.show()
We see plants at Suggs and Toolik are quite low in 13C, with more
spread at Toolik than Suggs, and plants at Prairie Lake are relatively
enriched. Clearly the next question is what species these data
represent. But taxonomic data aren’t present in the
apl_plantExternalLabDataPerSample
table, they’re in the
apl_biomass
table. We’ll need to join the two tables to get
chemistry by taxon.
Every NEON data product has a Quick Start Guide (QSG), and for OS
products it includes a section describing how to join the tables in the
data product. Since it’s a pdf file, loadByProduct()
doesn’t bring it in, but you can view the Aquatic plant chemistry QSG on
the
Product
Details page. The neonOS
package uses the information
from the QSGs to provide an automated table-joining function,
joinTableNEON()
.
Explore isotope data by species
R
apct <- joinTableNEON(apl_biomass,
apl_plantExternalLabDataPerSample)
Using the merged data, now we can plot carbon isotope ratio for each taxon.
boxplot(analyteConcentration~scientificName,
data=apct, subset=analyte=="d13C",
xlab=NA, ylab="d13C",
las=2, cex.axis=0.7)
Python
There is not yet an equivalent to the neonOS
package in
Python, so we will code the table join manually, based on the info in
the Quick Start Guide:
apct = pd.merge(apl_biomass,
apl_plantExternalLabDataPerSample,
left_on=["siteID", "chemSubsampleID"],
right_on=["siteID", "sampleID"],
how="outer")
Using the merged data, now we can plot carbon isotope ratio for each taxon.
apl13Cspp = apct[apct.analyte=="d13C"]
grouped = apl13Cspp.groupby("scientificName")["analyteConcentration"]
fig, ax = plt.subplots()
ax.boxplot(x=[group.values for name, group in grouped],
tick_labels=grouped.groups.keys())
And now we can see most of the sampled plants have carbon isotope ratios around -30, with just a few species accounting for most of the more enriched samples.
Download remote sensing data: byFileAOP() and byTileAOP()
Remote sensing data files are very large, so downloading them can
take a long time. byFileAOP()
and byTileAOP()
enable easier programmatic downloads, but be aware it can take a very
long time to download large amounts of data.
Input options for the AOP functions are:
dpID
: the data product ID, e.g. DP1.00002.001site
: the 4-letter code of a single site, e.g. HARVyear
: the 4-digit year to downloadsavepath
: the file path you want to download to; defaults to the working directorycheck.size
: T or F: should the function pause before downloading data and warn you about the size of your download? Defaults to T; if you are using this function within a script or batch process you will want to set it to F.easting
:byTileAOP()
only. Vector of easting UTM coordinates whose corresponding tiles you want to downloadnorthing
:byTileAOP()
only. Vector of northing UTM coordinates whose corresponding tiles you want to downloadbuffer
:byTileAOP()
only. Size in meters of buffer to include around coordinates when deciding which tiles to downloadtoken
: Optional NEON API token for faster downloads.chunk_size
: Only in Python. Set the chunk size of chunked downloads, can improve efficiency in some cases. Defaults to 1 MB.
Here, we’ll download one tile of Ecosystem structure (Canopy Height Model) (DP3.30015.001) from WREF in 2017.
R
byTileAOP(dpID="DP3.30015.001", site="WREF",
year=2017,easting=580000,
northing=5075000,
savepath="~/Downloads")
Python
nu.by_tile_aop(dpid="DP3.30015.001", site="WREF",
year=2017,easting=580000,
northing=5075000,
savepath=os.path.expanduser(
"~/Downloads"))
In the directory indicated in savepath
, you should now
have a folder named DP3.30015.001
with several nested
subfolders, leading to a tif f