Tutorial

Make Training Data for Species Modeling from NEON TOS Vegetation Structure Data

Authors: Bridget Hass

Last Updated: Aug 8, 2025

This notebook demonstrates how to generate a training dataset consisting of tree species, family, and location from the NEON Terrestrial Observation System (TOS) Vegetation Structure data product DP1.10098.001. We will use data from the Smithsonian Environmental Research Center (SERC) site in Maryland. In a subsequent tutorial titled Tree Classification with NEON Airborne Imaging Spectrometer Data using Python xarray, we will use this training dataset to train a random forest machine learning model that predicts tree families from the hyperspectral signatures obtained from the airborne remote sensing data. These two tutorials outline a relatively simple modeling example, and represent a starting point for conducting machine learning analyses using NEON data!

Set Up Python Environment

To run this notebook, you will need the following Python packages, which can be installed using !pip install or !conda install from within the notebook. Note that to use the neonutilities package, you will need Python version 3.9 or higher.

matplotlib
neonutilities
numpy
pandas
requests
seaborn

Create a NEON AOP Token

NEON API Token (optional, but strongly recommended), see NEON API Tokens Tutorial for more details on how to create and set up your token in Python (and R). Once you create your token (on the NEON User Accounts) page, this notebook will show you how to set it as an environment variable and use it for downloading AOP data.

Learning Objectives

Use the neonutilities load_by_product function to read in NEON vegetation structure data at a given site
Use the NEON locations API to determine the geographic position of the vegetation records in UTM x, y coordinates
Filter the datset to include only the latest data and columns of interest
Filter the data geospatially to keep data that are within a single AOP 1 km x 1 km tile

Disclaimer: this notebook is intended to provide an example of how to create an initial training data set for pairing with remote sensing data, and to conduct some exploratory analysis of the vegetation structure data. This does not incorporate outlier detection and removal, or comprehensive pre-processing steps. As part of creating a machine learning model, it is important to assess the training data quality and look for outliers or other potential data quality issues which may impact model results. Refer to the Compare tree height measured from the ground to a Lidar-based Canopy Height Model lesson (the first additional resource above) for more details on how you would address geographic mismatch between the AOP and TOS data.

Additional Resources

The lesson Compare tree height measured from the ground to a Lidar-based Canopy Height Model is another example of linking ground to airborne data, and shows similar steps of pre-processing TOS woody vegetation data.
The paper Individual canopy tree species maps for the National Ecological Observatory Network outlines methods for large-scale classification using NEON data. The associated NEON Science Seminar Harnessing NEON to enable the future of forest remote sensing may be a useful resource. This talk provides a high-level overview of modeling approaches for tree crown delineation and tree classification using NEON airborne remote sensing data. You can also watch the video below.

Refer to the Vegetation Structure User Guide for more details on this data product, and to better understand the data quality flags, the sampling.

1. Download and Explore Vegetation Structure Data (DP1.10098.001)

In this first section we’ll load the vegetation structure data, find the locations of the mapped trees, and join to the species and family observations.

Let's get started! First, import the required Python packages.

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import neonutilities as nu
import numpy as np
import pandas as pd
import requests
import seaborn as sns

Set up your NEON token. See the setup instructions at the beginning of the tutorial on how to set up a NEON user account and create a token, if you have not already done so.

# copy and paste your NEON token from your NEON user account page here
my_token=""

We can load the vegetation structure data using the load_by_product function in the neonutilities package (imported as nu). Inputs to the function can be shown by typing help(load_by_product).

Refer to h R neonUtilities cheat sheet or th Python neonutilities documentatione for more details and the completelistx of possible function inputs. The cheat sheet is focused on the R package, but nearly all the inputs are the sae in Python neonutilities.e

Note that in this example, we will pull in all the woody vegetation data (collected over all years), but if you are trying to modeldata collected in a single year, you can select just that year by specifying the startdate and enddate, or later filtering out the vegetation data by the eventID We have set check_size=False since the data are not very large, but to check the size of what the data you are downloading first, you could omit this input, or set it to True.

veg_dict = nu.load_by_product(dpid="DP1.10098.001", 
                              site="SERC", 
                              package="basic", 
                              release="RELEASE-2025",
                              token=my_token,
                              check_size=False)

Finding available files
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:45<00:00,  1.99s/it]
Downloading 23 NEON DP1.10098.001 files totaling approximately 40.0 MB.
Downloading files
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:24<00:00,  1.08s/it]
Stacking data files
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.36it/s]

Get a list of the points

veg_map_all = veg_dict["vst_mappingandtagging"]
veg_map = veg_map_all.loc[veg_map_all["pointID"] != ""]
veg_map = veg_map.reindex()
veg_map["points"] = veg_map["namedLocation"] + "." + veg_map["pointID"]
veg_points = list(set(list(veg_map["points"])))

Look at the unique eventIDs. . All sampling at a site that occurs within a given bout is identified by a unique eventID, which represents the date of the bout.

veg_map_all.eventID.unique()

array(['vst_SERC_2015', 'vst_SERC_2016', 'vst_SERC_2017', 'vst_SERC_2018',
       'vst_SERC_2019', 'vst_SERC_2020', 'vst_SERC_2021', 'vst_SERC_2022',
       'vst_SERC_2023'], dtype=object)

Get the number of records for each eventID:

# Group by 'eventID' and get the count
eventID_counts = veg_map_all[['individualID','eventID']].groupby(['eventID']).count()
print("\nCounts of each eventID:\n", eventID_counts)

Counts of each eventID:
                individualID
eventID                    
vst_SERC_2015          1890
vst_SERC_2016          1330
vst_SERC_2017            96
vst_SERC_2018           127
vst_SERC_2019           254
vst_SERC_2020            22
vst_SERC_2021            54
vst_SERC_2022           494
vst_SERC_2023            40

It looks like most of the trees were mapped in 2015 and 2016, which was when the SERC plots were first established. You could look at data only from one year, and compare to AOP data from the same year, or if you are not too worried about matching measurements to remote sensing data collected in the same year, you could use all years. We'll do the latter in this example.

2. Determine the geographic location of the surveyed vegetation

Loop through all of the points in veg_points to determine the easting and northing from the NEON Locations API.

easting = []
northing = []
coord_uncertainty = []
elev_uncertainty = []
for i in veg_points:
    vres = requests.get("https://data.neonscience.org/api/v0/locations/"+i)
    vres_json = vres.json()
    easting.append(vres_json["data"]["locationUtmEasting"])
    northing.append(vres_json["data"]["locationUtmNorthing"])
    props = pd.DataFrame.from_dict(vres_json["data"]["locationProperties"])
    cu = props.loc[props["locationPropertyName"]=="Value for Coordinate uncertainty"]["locationPropertyValue"]
    coord_uncertainty.append(cu[cu.index[0]])
    eu = props.loc[props["locationPropertyName"]=="Value for Elevation uncertainty"]["locationPropertyValue"]
    elev_uncertainty.append(eu[eu.index[0]])

pt_dict = dict(points=veg_points, 
               easting=easting,
               northing=northing,
               coordinateUncertainty=coord_uncertainty,
               elevationUncertainty=elev_uncertainty)

pt_df = pd.DataFrame.from_dict(pt_dict)
pt_df.set_index("points", inplace=True)

veg_map = veg_map.join(pt_df, 
                     on="points", 
                     how="inner")

Next, use the stemDistance and stemAzimuth data to calculate the precise locations of individuals, relative to the reference locations.

$Easting = easting.pointID + stemDistance*sin(\theta)$
$Northing = northing.pointID + stemDistance*cos(\theta)$
$\theta = stemAzimuth*\pi/180$

Also adjust the coordinate and elevation uncertainties.

veg_map["adjEasting"] = (veg_map["easting"]
                        + veg_map["stemDistance"]
                        * np.sin(veg_map["stemAzimuth"]
                                   * np.pi / 180))

veg_map["adjNorthing"] = (veg_map["northing"]
                        + veg_map["stemDistance"]
                        * np.cos(veg_map["stemAzimuth"]
                                   * np.pi / 180))

veg_map["adjCoordinateUncertainty"] = veg_map["coordinateUncertainty"] + 0.6

veg_map["adjElevationUncertainty"] = veg_map["elevationUncertainty"] + 1

Look at the columns to see all the information contained in this dataset.

# look at a subset of the columns that may be relevant
veg_map[['date','individualID','scientificName','taxonID','family','plotID','pointID','adjEasting','adjNorthing']].head(5)

	date	individualID	scientificName	taxonID	family	plotID	pointID	adjEasting	adjNorthing
1	2015-03-23	NEON.PLA.D02.SERC.08038	Carya glabra (Mill.) Sweet	CAGL8	Juglandaceae	SERC_045	43	364809.083993	4.304727e+06
2	2015-03-23	NEON.PLA.D02.SERC.08036	Liriodendron tulipifera L.	LITU	Magnoliaceae	SERC_045	21	364779.130881	4.304717e+06
6	2015-03-23	NEON.PLA.D02.SERC.05664	Quercus montana Willd.	QUMO4	Fagaceae	SERC_062	59	364857.704931	4.304826e+06
9	2015-03-23	NEON.PLA.D02.SERC.08009	Acer rubrum L.	ACRU	Aceraceae	SERC_045	23	364795.325837	4.304706e+06
10	2015-03-23	NEON.PLA.D02.SERC.08016	Acer rubrum L.	ACRU	Aceraceae	SERC_045	23	364790.181050	4.304714e+06

len(veg_map)

3. Filter to trees within an AOP tile extent

Now create a new dataframe containing only the veg data that are within a single AOP tile (which are 1 km x 1 km in size). For this, you will need to know the bounds (minimum and maximum UTM easting and northing) of the area you are sampling. For this exercise, we will choose the AOP data with SW (lower left) UTM coordinates of 364000, 4305000. This tile encompasses the NEON tower at the SERC site.

veg_tower_tile = veg_map[(veg_map['adjEasting'].between(364000, 365000)) & (veg_map['adjNorthing'].between(4305000, 4306000))]

How many records do we have within this tile?

len(veg_tower_tile)

There are 211 unique vegetation records in this area. We can also look at the unique taxonIDs that are represented.

# look at the unique Taxon IDs
veg_tower_tile.taxonID.unique()

array(['LIST2', 'ACRU', 'FAGR', 'ULMUS', 'LITU', 'CACA18', 'NYSY', 'QUFA',
       'QURU', 'QUAL', 'CAGL8', 'CATO6', 'QUVE', 'COFL2', 'QUERC',
       'QUMA3', 'PRAV', 'PINUS'], dtype=object)

Let's keep only a subset of the columns that we are interested in, and look at the dataframe:

veg_tower_tile_short = veg_tower_tile[['date','individualID','scientificName','taxonID','family','plotID','pointID','adjEasting','adjNorthing']]
veg_tower_tile_short.reset_index(drop=True, inplace=True)
veg_tower_tile_short

	date	individualID	scientificName	taxonID	family	plotID	pointID	adjEasting	adjNorthing
0	2015-03-25	NEON.PLA.D02.SERC.05537	Liquidambar styraciflua L.	LIST2	Hamamelidaceae	SERC_052	59	364578.230724	4.305896e+06
1	2015-03-25	NEON.PLA.D02.SERC.05546	Acer rubrum L.	ACRU	Aceraceae	SERC_052	61	364590.549216	4.305899e+06
2	2015-03-25	NEON.PLA.D02.SERC.05548	Fagus grandifolia Ehrh.	FAGR	Fagaceae	SERC_052	43	364586.368495	4.305883e+06
3	2015-03-25	NEON.PLA.D02.SERC.05549	Acer rubrum L.	ACRU	Aceraceae	SERC_052	59	364578.883395	4.305888e+06
4	2015-03-25	NEON.PLA.D02.SERC.05545	Acer rubrum L.	ACRU	Aceraceae	SERC_052	61	364590.276108	4.305897e+06
...	...	...	...	...	...	...	...	...	...
206	2023-01-10	NEON.PLA.D02.SERC.09712	Carya tomentosa (Lam.) Nutt.	CATO6	Juglandaceae	SERC_056	43	364708.570635	4.305389e+06
207	2023-01-11	NEON.PLA.D02.SERC.09708	Pinus sp.	PINUS	Pinaceae	SERC_056	59	364701.345113	4.305407e+06
208	2023-01-16	NEON.PLA.D02.SERC.09358	Pinus sp.	PINUS	Pinaceae	SERC_057	41	364438.131683	4.305409e+06
209	2023-01-16	NEON.PLA.D02.SERC.14563	Quercus alba L.	QUAL	Fagaceae	SERC_057	43	364470.101669	4.305412e+06
210	2023-01-16	NEON.PLA.D02.SERC.14548	Fagus grandifolia Ehrh.	FAGR	Fagaceae	SERC_057	41	364455.772024	4.305415e+06

211 rows × 9 columns

To get a better sense of the data, we can also look at the # of each species, to see if some species have more representation than others.

# display the taxonID counts, sorted descending
veg_tower_tile_taxon_counts = veg_tower_tile[['individualID','taxonID']].groupby(['taxonID']).count()
veg_tower_tile_taxon_counts.sort_values(by='individualID',ascending=False)

	individualID
taxonID
FAGR	48
LITU	35
LIST2	29
ACRU	16
CAGL8	12
QUAL	11
CATO6	11
CACA18	11
NYSY	10
QUFA	9
ULMUS	5
QURU	4
COFL2	3
QUVE	2
PINUS	2
PRAV	1
QUERC	1
QUMA3	1

# display the family counts, sorted descending
veg_tower_tile_family_counts = veg_tower_tile[['individualID','family']].groupby(['family']).count()
veg_tower_tile_family_counts.sort_values(by='individualID',ascending=False)

	individualID
family
Fagaceae	76
Magnoliaceae	35
Hamamelidaceae	29
Juglandaceae	23
Aceraceae	16
Cornaceae	13
Betulaceae	11
Ulmaceae	5
Pinaceae	2
Rosaceae	1

It looks like there are a number of different species (and families) mapped in this tower plot. You can use the https://plants.usda.gov website to look up the species information. The top 5 most abundant mapped species are linked below.

FAGR: American Beech (Fagus grandifolia Ehrh.)
LITU: Tuliptree (Liriodendron tulipifera L.)
LIST2: Sweetgum (Liquidambar styraciflua L.)
ACRU: Red Maple (Acer rubrum L.)
CAGL8: Sweet pignut hickory (Carya glabra (Mill.))

When carrying out classification, the species that only have small representation (1-5 samples) may not be modeled accurately due to a lack of sufficient training data. The challenge of mapping rarer species due to insufficient training data is well known. In the next tutorial, we will remove these poorly represented samples before generating a model.

4. Write training dataframe to csv file

Nonetheless, we have a fairly decent training dataset to work with. We can save the dataframe to a csv file called serc_training_data.csv as follows:

veg_tower_tile_short.to_csv(r'.\data\serc_training_data.csv',index=False)

5. Plot tree families in map view

Finally, we can make a quick plot using seaborn (imported as sns) to show the spatial distrubtion of the trees surveyed in this area, along with their species (scientificName). Most of this code helps improve the formatting and appearance of the figure; the first sns.scatterplot chunk is all you really need to do to plot the essentials.

ax = sns.scatterplot(
    data=veg_tower_tile_short,
    x='adjEasting',
    y='adjNorthing',
    hue='family',
)

# Make the x and y dimensions are equal
ax.set_aspect('equal', adjustable='box')

# Remove scientific notation on the x and y axes labels
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:.0f}'))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda y, _: f'{y:.0f}'))

# Place the legend outside the plot at the center right
plt.legend(loc='center left', bbox_to_anchor=(1.05, 0.5))

# Adjust layout to prevent legend overlap
plt.tight_layout()

# Add title and axis labels
ax.set_title("SERC Tree Families", fontsize=14)
ax.set_xlabel("Easting (m)", fontsize=12)
ax.set_ylabel("Northing (m)", fontsize=12)
plt.yticks(fontsize=8)  
plt.xticks(fontsize=8)  

plt.show()

png

Great! We can see all the trees that were surveyed in this AOP tile. The trees are sampled in discrete plots. For more information about the TOS sampling design, please refer to the Vegetation structure data product page.

Recap

In this lesson, we have created a training data set containing information about the tree family and species as well as their geographic locations in UTM x, y coordinates. We can now pair this training data set with the remote sensing data and create a model to predict the tree's family based off airborne spectral data. The next tutorial, Tree Classification with NEON Airborne Imaging Spectrometer Data using Python xarray, will show how to do this!

Note: you may wish to create a training dataframe that contains additional information about the trees. For example, you can also include parameters like the growth form (e.g. whether the vegetation is a shrub, single-bole or multi-bole tree, etc.), the plant status (whether the tree is healthy or standing dead), and measurements such as the stem diameter and tree height. To do this, you would need to join the vst_mappingandtagging table with the vst_apparentindividual tables. Refer to the Quick Start Guide for Vegetation Structure for more information about the data tables and the joining instructions. You can also refer to the lesson Compare tree height measured from the ground to a Lidar-based Canopy Height Model which provides an example of how to do this and compare the TOS measured data with the AOP Lidar-derived Canopy Height Model (Ecosystem Structure) data product.

Get Lesson Code

make_classification_training_data.ipynb