Skip to main content
NSF NEON, Operated by Battelle

Main navigation

  • About Us
    • Overview
      • Spatial and Temporal Design
      • History
    • Vision and Management
    • Advisory Groups
      • Science, Technology & Education Advisory Committee
      • Technical Working Groups (TWGs)
    • FAQ
    • Contact Us
      • Contact NEON Biorepository
      • Field Offices
    • User Accounts
    • Staff
    • Code of Conduct

    About Us

  • Data & Samples
    • Data Portal
      • Explore Data Products
      • Data Availability Charts
      • Spatial Data & Maps
      • Document Library
      • API & GraphQL
      • Prototype Data
      • External Lab Data Ingest (restricted)
    • Data Themes
      • Atmosphere
      • Biogeochemistry
      • Ecohydrology
      • Land Cover and Processes
      • Organisms, Populations, and Communities
    • Samples & Specimens
      • Discover and Use NEON Samples
        • Sample Types
        • Sample Repositories
        • Sample Explorer
        • Megapit and Distributed Initial Characterization Soil Archives
      • Sample Processing
      • Sample Quality
      • Taxonomic Lists
    • Collection Methods
      • Protocols & Standardized Methods
      • Airborne Remote Sensing
        • Flight Box Design
        • Flight Schedules and Coverage
        • Daily Flight Reports
          • AOP Flight Report Sign Up
        • Camera
        • Imaging Spectrometer
        • Lidar
      • Automated Instruments
        • Site Level Sampling Design
        • Sensor Collection Frequency
        • Instrumented Collection Types
          • Meteorology
          • Phenocams
          • Soil Sensors
          • Ground Water
          • Surface Water
      • Observational Sampling
        • Site Level Sampling Design
        • Sampling Schedules
        • Observation Types
          • Aquatic Organisms
            • Aquatic Microbes
            • Fish
            • Macroinvertebrates & Zooplankton
            • Periphyton, Phytoplankton, and Aquatic Plants
          • Terrestrial Organisms
            • Birds
            • Ground Beetles
            • Mosquitoes
            • Small Mammals
            • Soil Microbes
            • Terrestrial Plants
            • Ticks
          • Hydrology & Geomorphology
            • Discharge
            • Geomorphology
          • Biogeochemistry
          • DNA Sequences
          • Pathogens
          • Sediments
          • Soils
            • Soil Descriptions
        • Optimizing the Observational Sampling Designs
    • Data Notifications
    • Data Guidelines and Policies
      • Acknowledging and Citing NEON
      • Publishing Research Outputs
      • Usage Policies
    • Data Management
      • Data Availability
      • Data Formats and Conventions
      • Data Processing
      • Data Quality
      • Data Product Bundles
      • Data Product Revisions and Releases
        • Release 2021
        • Release 2022
        • Release 2023
        • Release 2024
        • Release-2025
      • NEON and Google
      • Externally Hosted Data

    Data & Samples

  • Field Sites
    • About Field Sites and Domains
    • Explore Field Sites
    • Site Management Data Product

    Field Sites

  • Impact
    • Observatory Blog
    • Case Studies
    • Papers & Publications
    • Newsroom
      • NEON in the News
      • Newsletter Archive
      • Newsletter Sign Up

    Impact

  • Resources
    • Getting Started with NEON Data & Resources
    • Documents and Communication Resources
      • Papers & Publications
      • Document Library
      • Outreach Materials
    • Code Hub
      • Code Resources Guidelines
      • Code Resources Submission
      • NEON's GitHub Organization Homepage
    • Learning Hub
      • Science Videos
      • Tutorials
      • Workshops & Courses
      • Teaching Modules
    • Research Support Services
      • Field Site Coordination
      • Letters of Support
      • Mobile Deployment Platforms
      • Permits and Permissions
      • AOP Flight Campaigns
      • Research Support FAQs
      • Research Support Projects
    • Funding Opportunities

    Resources

  • Get Involved
    • Advisory Groups
      • Science, Technology & Education Advisory Committee
      • Technical Working Groups
    • Upcoming Events
    • NEON Ambassador Program
      • Exploring NEON-Derived Data Products Workshop Series
    • Research and Collaborations
      • Environmental Data Science Innovation and Inclusion Lab
      • Collaboration with DOE BER User Facilities and Programs
      • EFI-NEON Ecological Forecasting Challenge
      • NEON Great Lakes User Group
      • NEON Science Summit
      • NCAR-NEON-Community Collaborations
        • NCAR-NEON Community Steering Committee
    • Community Engagement
      • How Community Feedback Impacts NEON Operations
    • Science Seminars and Data Skills Webinars
      • Past Years
    • Work Opportunities
      • Careers
      • Seasonal Fieldwork
      • Internships
        • Intern Alumni
    • Partners

    Get Involved

  • My Account
  • Search

Search

Learning Hub

  • Science Videos
  • Tutorials
  • Workshops & Courses
  • Teaching Modules

Breadcrumb

  1. Resources
  2. Learning Hub
  3. Tutorials
  4. Make Training Data for Species Modeling from NEON TOS Vegetation Structure Data

Tutorial

Make Training Data for Species Modeling from NEON TOS Vegetation Structure Data

Authors: Bridget Hass

Last Updated: Aug 8, 2025

This notebook demonstrates how to generate a training dataset consisting of tree species, family, and location from the NEON Terrestrial Observation System (TOS) Vegetation Structure data product DP1.10098.001. We will use data from the Smithsonian Environmental Research Center (SERC) site in Maryland. In a subsequent tutorial titled Tree Classification with NEON Airborne Imaging Spectrometer Data using Python xarray, we will use this training dataset to train a random forest machine learning model that predicts tree families from the hyperspectral signatures obtained from the airborne remote sensing data. These two tutorials outline a relatively simple modeling example, and represent a starting point for conducting machine learning analyses using NEON data!

Set Up Python Environment

To run this notebook, you will need the following Python packages, which can be installed using !pip install or !conda install from within the notebook. Note that to use the neonutilities package, you will need Python version 3.9 or higher.

  • matplotlib
  • neonutilities
  • numpy
  • pandas
  • requests
  • seaborn

Create a NEON AOP Token

  • NEON API Token (optional, but strongly recommended), see NEON API Tokens Tutorial for more details on how to create and set up your token in Python (and R). Once you create your token (on the NEON User Accounts) page, this notebook will show you how to set it as an environment variable and use it for downloading AOP data.

Learning Objectives

  • Use the neonutilities load_by_product function to read in NEON vegetation structure data at a given site
  • Use the NEON locations API to determine the geographic position of the vegetation records in UTM x, y coordinates
  • Filter the datset to include only the latest data and columns of interest
  • Filter the data geospatially to keep data that are within a single AOP 1 km x 1 km tile

Disclaimer: this notebook is intended to provide an example of how to create an initial training data set for pairing with remote sensing data, and to conduct some exploratory analysis of the vegetation structure data. This does not incorporate outlier detection and removal, or comprehensive pre-processing steps. As part of creating a machine learning model, it is important to assess the training data quality and look for outliers or other potential data quality issues which may impact model results. Refer to the Compare tree height measured from the ground to a Lidar-based Canopy Height Model lesson (the first additional resource above) for more details on how you would address geographic mismatch between the AOP and TOS data.

Additional Resources

  • The lesson Compare tree height measured from the ground to a Lidar-based Canopy Height Model is another example of linking ground to airborne data, and shows similar steps of pre-processing TOS woody vegetation data.

  • The paper Individual canopy tree species maps for the National Ecological Observatory Network outlines methods for large-scale classification using NEON data. The associated NEON Science Seminar Harnessing NEON to enable the future of forest remote sensing may be a useful resource. This talk provides a high-level overview of modeling approaches for tree crown delineation and tree classification using NEON airborne remote sensing data. You can also watch the video below.

  • Refer to the Vegetation Structure User Guide for more details on this data product, and to better understand the data quality flags, the sampling.

1. Download and Explore Vegetation Structure Data (DP1.10098.001)

In this first section we’ll load the vegetation structure data, find the locations of the mapped trees, and join to the species and family observations.

Let's get started! First, import the required Python packages.

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import neonutilities as nu
import numpy as np
import pandas as pd
import requests
import seaborn as sns

Set up your NEON token. See the setup instructions at the beginning of the tutorial on how to set up a NEON user account and create a token, if you have not already done so.

# copy and paste your NEON token from your NEON user account page here
my_token=""

We can load the vegetation structure data using the load_by_product function in the neonutilities package (imported as nu). Inputs to the function can be shown by typing help(load_by_product).

Refer to h R neonUtilities cheat sheet or th Python neonutilities documentatione for more details and the completelistx of possible function inputs. The cheat sheet is focused on the R package, but nearly all the inputs are the sae in Python neonutilities.e

Note that in this example, we will pull in all the woody vegetation data (collected over all years), but if you are trying to modeldata collected in a single year, you can select just that year by specifying the startdate and enddate, or later filtering out the vegetation data by the eventID We have set check_size=False since the data are not very large, but to check the size of what the data you are downloading first, you could omit this input, or set it to True.

veg_dict = nu.load_by_product(dpid="DP1.10098.001", 
                              site="SERC", 
                              package="basic", 
                              release="RELEASE-2025",
                              token=my_token,
                              check_size=False)
Finding available files
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:45<00:00,  1.99s/it]
Downloading 23 NEON DP1.10098.001 files totaling approximately 40.0 MB.
Downloading files
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:24<00:00,  1.08s/it]
Stacking data files
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.36it/s]

Get a list of the points

veg_map_all = veg_dict["vst_mappingandtagging"]
veg_map = veg_map_all.loc[veg_map_all["pointID"] != ""]
veg_map = veg_map.reindex()
veg_map["points"] = veg_map["namedLocation"] + "." + veg_map["pointID"]
veg_points = list(set(list(veg_map["points"])))

Look at the unique eventIDs. . All sampling at a site that occurs within a given bout is identified by a unique eventID, which represents the date of the bout.

veg_map_all.eventID.unique()
array(['vst_SERC_2015', 'vst_SERC_2016', 'vst_SERC_2017', 'vst_SERC_2018',
       'vst_SERC_2019', 'vst_SERC_2020', 'vst_SERC_2021', 'vst_SERC_2022',
       'vst_SERC_2023'], dtype=object)

Get the number of records for each eventID:

# Group by 'eventID' and get the count
eventID_counts = veg_map_all[['individualID','eventID']].groupby(['eventID']).count()
print("\nCounts of each eventID:\n", eventID_counts)
Counts of each eventID:
                individualID
eventID                    
vst_SERC_2015          1890
vst_SERC_2016          1330
vst_SERC_2017            96
vst_SERC_2018           127
vst_SERC_2019           254
vst_SERC_2020            22
vst_SERC_2021            54
vst_SERC_2022           494
vst_SERC_2023            40

It looks like most of the trees were mapped in 2015 and 2016, which was when the SERC plots were first established. You could look at data only from one year, and compare to AOP data from the same year, or if you are not too worried about matching measurements to remote sensing data collected in the same year, you could use all years. We'll do the latter in this example.

2. Determine the geographic location of the surveyed vegetation

Loop through all of the points in veg_points to determine the easting and northing from the NEON Locations API.

easting = []
northing = []
coord_uncertainty = []
elev_uncertainty = []
for i in veg_points:
    vres = requests.get("https://data.neonscience.org/api/v0/locations/"+i)
    vres_json = vres.json()
    easting.append(vres_json["data"]["locationUtmEasting"])
    northing.append(vres_json["data"]["locationUtmNorthing"])
    props = pd.DataFrame.from_dict(vres_json["data"]["locationProperties"])
    cu = props.loc[props["locationPropertyName"]=="Value for Coordinate uncertainty"]["locationPropertyValue"]
    coord_uncertainty.append(cu[cu.index[0]])
    eu = props.loc[props["locationPropertyName"]=="Value for Elevation uncertainty"]["locationPropertyValue"]
    elev_uncertainty.append(eu[eu.index[0]])

pt_dict = dict(points=veg_points, 
               easting=easting,
               northing=northing,
               coordinateUncertainty=coord_uncertainty,
               elevationUncertainty=elev_uncertainty)

pt_df = pd.DataFrame.from_dict(pt_dict)
pt_df.set_index("points", inplace=True)

veg_map = veg_map.join(pt_df, 
                     on="points", 
                     how="inner")

Next, use the stemDistance and stemAzimuth data to calculate the precise locations of individuals, relative to the reference locations.

  • $Easting = easting.pointID + stemDistance*sin(\theta)$
  • $Northing = northing.pointID + stemDistance*cos(\theta)$
  • $\theta = stemAzimuth*\pi/180$

Also adjust the coordinate and elevation uncertainties.

veg_map["adjEasting"] = (veg_map["easting"]
                        + veg_map["stemDistance"]
                        * np.sin(veg_map["stemAzimuth"]
                                   * np.pi / 180))

veg_map["adjNorthing"] = (veg_map["northing"]
                        + veg_map["stemDistance"]
                        * np.cos(veg_map["stemAzimuth"]
                                   * np.pi / 180))

veg_map["adjCoordinateUncertainty"] = veg_map["coordinateUncertainty"] + 0.6

veg_map["adjElevationUncertainty"] = veg_map["elevationUncertainty"] + 1

Look at the columns to see all the information contained in this dataset.

# look at a subset of the columns that may be relevant
veg_map[['date','individualID','scientificName','taxonID','family','plotID','pointID','adjEasting','adjNorthing']].head(5)
date individualID scientificName taxonID family plotID pointID adjEasting adjNorthing
1 2015-03-23 NEON.PLA.D02.SERC.08038 Carya glabra (Mill.) Sweet CAGL8 Juglandaceae SERC_045 43 364809.083993 4.304727e+06
2 2015-03-23 NEON.PLA.D02.SERC.08036 Liriodendron tulipifera L. LITU Magnoliaceae SERC_045 21 364779.130881 4.304717e+06
6 2015-03-23 NEON.PLA.D02.SERC.05664 Quercus montana Willd. QUMO4 Fagaceae SERC_062 59 364857.704931 4.304826e+06
9 2015-03-23 NEON.PLA.D02.SERC.08009 Acer rubrum L. ACRU Aceraceae SERC_045 23 364795.325837 4.304706e+06
10 2015-03-23 NEON.PLA.D02.SERC.08016 Acer rubrum L. ACRU Aceraceae SERC_045 23 364790.181050 4.304714e+06
len(veg_map)
1211

3. Filter to trees within an AOP tile extent

Now create a new dataframe containing only the veg data that are within a single AOP tile (which are 1 km x 1 km in size). For this, you will need to know the bounds (minimum and maximum UTM easting and northing) of the area you are sampling. For this exercise, we will choose the AOP data with SW (lower left) UTM coordinates of 364000, 4305000. This tile encompasses the NEON tower at the SERC site.

veg_tower_tile = veg_map[(veg_map['adjEasting'].between(364000, 365000)) & (veg_map['adjNorthing'].between(4305000, 4306000))]

How many records do we have within this tile?

len(veg_tower_tile)
211

There are 211 unique vegetation records in this area. We can also look at the unique taxonIDs that are represented.

# look at the unique Taxon IDs
veg_tower_tile.taxonID.unique()
array(['LIST2', 'ACRU', 'FAGR', 'ULMUS', 'LITU', 'CACA18', 'NYSY', 'QUFA',
       'QURU', 'QUAL', 'CAGL8', 'CATO6', 'QUVE', 'COFL2', 'QUERC',
       'QUMA3', 'PRAV', 'PINUS'], dtype=object)

Let's keep only a subset of the columns that we are interested in, and look at the dataframe:

veg_tower_tile_short = veg_tower_tile[['date','individualID','scientificName','taxonID','family','plotID','pointID','adjEasting','adjNorthing']]
veg_tower_tile_short.reset_index(drop=True, inplace=True)
veg_tower_tile_short
date individualID scientificName taxonID family plotID pointID adjEasting adjNorthing
0 2015-03-25 NEON.PLA.D02.SERC.05537 Liquidambar styraciflua L. LIST2 Hamamelidaceae SERC_052 59 364578.230724 4.305896e+06
1 2015-03-25 NEON.PLA.D02.SERC.05546 Acer rubrum L. ACRU Aceraceae SERC_052 61 364590.549216 4.305899e+06
2 2015-03-25 NEON.PLA.D02.SERC.05548 Fagus grandifolia Ehrh. FAGR Fagaceae SERC_052 43 364586.368495 4.305883e+06
3 2015-03-25 NEON.PLA.D02.SERC.05549 Acer rubrum L. ACRU Aceraceae SERC_052 59 364578.883395 4.305888e+06
4 2015-03-25 NEON.PLA.D02.SERC.05545 Acer rubrum L. ACRU Aceraceae SERC_052 61 364590.276108 4.305897e+06
... ... ... ... ... ... ... ... ... ...
206 2023-01-10 NEON.PLA.D02.SERC.09712 Carya tomentosa (Lam.) Nutt. CATO6 Juglandaceae SERC_056 43 364708.570635 4.305389e+06
207 2023-01-11 NEON.PLA.D02.SERC.09708 Pinus sp. PINUS Pinaceae SERC_056 59 364701.345113 4.305407e+06
208 2023-01-16 NEON.PLA.D02.SERC.09358 Pinus sp. PINUS Pinaceae SERC_057 41 364438.131683 4.305409e+06
209 2023-01-16 NEON.PLA.D02.SERC.14563 Quercus alba L. QUAL Fagaceae SERC_057 43 364470.101669 4.305412e+06
210 2023-01-16 NEON.PLA.D02.SERC.14548 Fagus grandifolia Ehrh. FAGR Fagaceae SERC_057 41 364455.772024 4.305415e+06

211 rows × 9 columns

To get a better sense of the data, we can also look at the # of each species, to see if some species have more representation than others.

# display the taxonID counts, sorted descending
veg_tower_tile_taxon_counts = veg_tower_tile[['individualID','taxonID']].groupby(['taxonID']).count()
veg_tower_tile_taxon_counts.sort_values(by='individualID',ascending=False)
individualID
taxonID
FAGR 48
LITU 35
LIST2 29
ACRU 16
CAGL8 12
QUAL 11
CATO6 11
CACA18 11
NYSY 10
QUFA 9
ULMUS 5
QURU 4
COFL2 3
QUVE 2
PINUS 2
PRAV 1
QUERC 1
QUMA3 1
# display the family counts, sorted descending
veg_tower_tile_family_counts = veg_tower_tile[['individualID','family']].groupby(['family']).count()
veg_tower_tile_family_counts.sort_values(by='individualID',ascending=False)
individualID
family
Fagaceae 76
Magnoliaceae 35
Hamamelidaceae 29
Juglandaceae 23
Aceraceae 16
Cornaceae 13
Betulaceae 11
Ulmaceae 5
Pinaceae 2
Rosaceae 1

It looks like there are a number of different species (and families) mapped in this tower plot. You can use the https://plants.usda.gov website to look up the species information. The top 5 most abundant mapped species are linked below.

  • FAGR: American Beech (Fagus grandifolia Ehrh.)
  • LITU: Tuliptree (Liriodendron tulipifera L.)
  • LIST2: Sweetgum (Liquidambar styraciflua L.)
  • ACRU: Red Maple (Acer rubrum L.)
  • CAGL8: Sweet pignut hickory (Carya glabra (Mill.))

When carrying out classification, the species that only have small representation (1-5 samples) may not be modeled accurately due to a lack of sufficient training data. The challenge of mapping rarer species due to insufficient training data is well known. In the next tutorial, we will remove these poorly represented samples before generating a model.

4. Write training dataframe to csv file

Nonetheless, we have a fairly decent training dataset to work with. We can save the dataframe to a csv file called serc_training_data.csv as follows:

veg_tower_tile_short.to_csv(r'.\data\serc_training_data.csv',index=False)

5. Plot tree families in map view

Finally, we can make a quick plot using seaborn (imported as sns) to show the spatial distrubtion of the trees surveyed in this area, along with their species (scientificName). Most of this code helps improve the formatting and appearance of the figure; the first sns.scatterplot chunk is all you really need to do to plot the essentials.

ax = sns.scatterplot(
    data=veg_tower_tile_short,
    x='adjEasting',
    y='adjNorthing',
    hue='family',
)

# Make the x and y dimensions are equal
ax.set_aspect('equal', adjustable='box')

# Remove scientific notation on the x and y axes labels
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:.0f}'))
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda y, _: f'{y:.0f}'))

# Place the legend outside the plot at the center right
plt.legend(loc='center left', bbox_to_anchor=(1.05, 0.5))

# Adjust layout to prevent legend overlap
plt.tight_layout()

# Add title and axis labels
ax.set_title("SERC Tree Families", fontsize=14)
ax.set_xlabel("Easting (m)", fontsize=12)
ax.set_ylabel("Northing (m)", fontsize=12)
plt.yticks(fontsize=8)  
plt.xticks(fontsize=8)  

plt.show()

png

Great! We can see all the trees that were surveyed in this AOP tile. The trees are sampled in discrete plots. For more information about the TOS sampling design, please refer to the Vegetation structure data product page.

Recap

In this lesson, we have created a training data set containing information about the tree family and species as well as their geographic locations in UTM x, y coordinates. We can now pair this training data set with the remote sensing data and create a model to predict the tree's family based off airborne spectral data. The next tutorial, Tree Classification with NEON Airborne Imaging Spectrometer Data using Python xarray, will show how to do this!

Note: you may wish to create a training dataframe that contains additional information about the trees. For example, you can also include parameters like the growth form (e.g. whether the vegetation is a shrub, single-bole or multi-bole tree, etc.), the plant status (whether the tree is healthy or standing dead), and measurements such as the stem diameter and tree height. To do this, you would need to join the vst_mappingandtagging table with the vst_apparentindividual tables. Refer to the Quick Start Guide for Vegetation Structure for more information about the data tables and the joining instructions. You can also refer to the lesson Compare tree height measured from the ground to a Lidar-based Canopy Height Model which provides an example of how to do this and compare the TOS measured data with the AOP Lidar-derived Canopy Height Model (Ecosystem Structure) data product.

Get Lesson Code

make_classification_training_data.ipynb

Questions?

If you have questions or comments on this content, please contact us.

Contact Us
NSF NEON, Operated by Battelle

Follow Us:

Join Our Newsletter

Get updates on events, opportunities, and how NEON is being used today.

Subscribe Now

Footer

  • About Us
  • Newsroom
  • Contact Us
  • Terms & Conditions
  • Careers
  • Code of Conduct

Copyright © Battelle, 2025

The National Ecological Observatory Network is a major facility fully funded by the U.S. National Science Foundation.

Any opinions, findings and conclusions or recommendations expressed in this material do not necessarily reflect the views of the U.S. National Science Foundation.