Tutorial
Exploring sample availability at the NEON Biorepository
Authors: [Kelsey Yule]
Last Updated: May 22, 2025
Learning Objectives
In this tutorial, we will learn how to develop a sample list to optimally answer a research question based on NEON data product and NEON Biorepository sample availability.
-
Outline a broad research question.
-
Download related data from the main NEON data portal using the
neonUtilities
R package. -
Compare NEON data availability across multiple data products in order to narrow research scope.
-
Identify NEON Biorepository samples that match the research scope of interest using NEON Biorepository data.
-
Visualize our request in an interactive map.
Research question
We are interested in testing for relationships between the diet of small mammals and the carbon and nitrogen stable isotope ratios in co-occurring plant communities in a portion of the eastern United States. NEON provides extensive information about stable isotopes from samples collected from the canopy and in litter traps. While NEON does not measure small mammal diets as part of its focal protocols, NEON archives both hair and fecal samples that researchers can use to gain these insights.
What samples from the NEON Biorepository can be requested in order to conduct this study?
To answer this this question, we need to understand where there is spatial and temporal overlap in measurements of canopy foliar chemistry, litter chemistry, and mammal sampling.
We will attempt to develop a list of samples following these criteria:
-
Site and year combinations for which carbon nitrogen ratio measurements are available for both canopy foliage and litter
-
Common species for which we can achieve a minimum viable sample size for our study
-
Both a hair and a fecal sample were collected from the same individual
Background information on the NEON Biorepository
The NEON Biorepository is located at Arizona State University and serves as the primary repository for the 100,000 NEON samples and specimens collected across all 81 NEON field sites each year.
The NEON Biorepository data portal allows users to search and download records associated with NEON samples and specimens, request samples and specimens for research, and publish sample-associated data. The NEON Biorepository data portal is built on open-source Symbiota software. Symbiota is the most frequently used software in North America for managing natural history collections records.
Symbiota portals publish sample, specimen, and observation data following the Darwin Core Standard developed by Biodiversity Information Standards (TDWG). This data standard is a stable, straightforward, and flexible framework for compiling biodiversity data from varied sources.
Getting Started
If you do not have the required packages installed previously, use the install.packages()
function to do so.
install.packages('tidyverse')
install.packages('neonUtilities')
install.packages('curl')
install.packages('leaflet')
install.packages('leaflet.minicharts')
install.packages('lubridate')
install.packages('ggplot2')
Once installed, load the packages.
library(tidyverse)
library(neonUtilities)
library(curl)
library(leaflet)
library(leaflet.minicharts)
library(lubridate)
library(ggplot2)
Obtain relevant NEON Terrestrial Observation System data
In order to answer our question, we need to know which NEON sites and years correspond to available carbon and nitrogen stable isotope data for canopy foliage and litter samples. We will use the loadByProduct()
function in the
neonUtilities
package to download all of the data from these two data products from the main NEON data portal. This requires that we know the NEON data product IDs for each relevant data product.
NEON Plant Foliar Traits: DP1.10026.001
NEON Litterfall and fine woody debris production and chemistry: DP1.10033.001
Because we are interested in the a portion of the eastern United States, we will subset the available data to sites in NEON Domains 1, 2, and 7.
This download may take a few minutes. While not required, it is recommended that you use a NEON API token to achieve faster download speeds.
Note that we have chosen to include Provisional data for this exploratory analysis of sample and data availability. If you are interested in ensuring repeatability of analysis results, you should limit your download to data in a Release.
NEON.cfc <- loadByProduct(dpID="DP1.10026.001",
include.provisional=TRUE,
site=c('BLAN','BART','GRSM','HARV',
'MLBS','ORNL','SCBI','SERC'),
check.size=FALSE)
NEON.ltr <- loadByProduct(dpID="DP1.10033.001",
site=c('BLAN','BART','GRSM','HARV',
'MLBS','ORNL','SCBI','SERC'),
check.size=FALSE)
Let's take a look at what is included in each NEON data product download. We will extract and focus on the table that has collated all of the records for individual traps.
# What's in a download?
names(NEON.cfc)
## [1] "categoricalCodes_10026" "cfc_carbonNitrogen" "cfc_chemistrySubsampling"
## [4] "cfc_chlorophyll" "cfc_elements" "cfc_fieldData"
## [7] "cfc_lignin" "cfc_LMA" "cfc_shapefile"
## [10] "citation_10026_PROVISIONAL" "citation_10026_RELEASE-2025" "issueLog_10026"
## [13] "readme_10026" "validation_10026" "variables_10026"
names(NEON.ltr)
## [1] "categoricalCodes_10033" "citation_10033_RELEASE-2025" "issueLog_10033"
## [4] "ltr_chemistrySubsampling" "ltr_fielddata" "ltr_litterCarbonNitrogen"
## [7] "ltr_litterLignin" "ltr_massdata" "ltr_pertrap"
## [10] "readme_10033" "validation_10033" "variables_10033"
We can see that the data downloads for each product include several tables. The Quick Start Guides on any NEON data product description page are especially useful for understanding these tables, as are the "variables" and "readme" files included in data downloads. It is recommended that anyone who plans to use NEON data in their work carefully review the associated reading materials.
Narrow the spatial and temporal scope based on available data
For our purpose, we are interested in the files containing measurements of carbon and nitrogen, so we will extract those data tables.
cfc <- NEON.cfc$cfc_carbonNitrogen
ltr <- NEON.ltr$ltr_litterCarbonNitrogen
We will summarize the available data to find of year by site combinations for which both foliar and litter chemistry measurements are available.
summary.cfc <- cfc %>%
mutate(year=year(collectDate)) %>%
group_by(siteID,year) %>%
summarise(n=length(uid),meanCN=mean(CNratio,na.rm=TRUE))
summary.cfc
## # A tibble: 14 × 4
## # Groups: siteID [8]
## siteID year n meanCN
## <chr> <dbl> <int> <dbl>
## 1 BART 2022 44 36.1
## 2 BLAN 2020 48 24.0
## 3 GRSM 2016 45 25.6
## 4 GRSM 2021 55 26.1
## 5 HARV 2018 45 33.5
## 6 HARV 2024 60 36.0
## 7 MLBS 2018 45 23.2
## 8 MLBS 2023 46 22.9
## 9 ORNL 2017 42 29.6
## 10 ORNL 2022 58 27.5
## 11 SCBI 2017 44 21.3
## 12 SCBI 2022 46 14.8
## 13 SERC 2016 36 26.6
## 14 SERC 2021 58 29.9
summary.ltr <- ltr %>%
mutate(year=year(collectDate)) %>%
group_by(siteID,year) %>%
summarise(n=length(uid),meanCN=mean(CNratio,na.rm=TRUE))
summary.ltr
## # A tibble: 12 × 4
## # Groups: siteID [8]
## siteID year n meanCN
## <chr> <dbl> <int> <dbl>
## 1 BART 2016 37 98.6
## 2 BART 2022 27 77.2
## 3 BLAN 2020 15 39.5
## 4 GRSM 2021 14 71.7
## 5 HARV 2018 58 64.8
## 6 MLBS 2018 19 42.3
## 7 MLBS 2023 31 62.6
## 8 ORNL 2017 25 62.9
## 9 ORNL 2022 20 72.5
## 10 SCBI 2017 13 59.5
## 11 SCBI 2022 17 47.0
## 12 SERC 2021 12 72.7
We can see that there are more year by site combinations for which CN ratio data exist from canopy foliage than from litter samples. Since we are interested in studying both components of the ecosystem, let's subset our data to only those instances for which both sets of data are available. To do this we need to join our datasets by site and year.
CN <- full_join(summary.cfc,summary.ltr,
join_by("siteID"=="siteID","year"=="year"),
suffix = c(".cfc",".ltr")) %>%
filter(meanCN.cfc>0,meanCN.ltr>0)
We will then select the most recent year of data from each of the site x year combinations and add a column for a new "site by year" variable.
CN <- CN %>%
filter(!duplicated(siteID,fromLast = TRUE)) %>%
mutate(siteYear=paste(siteID,year,sep="."))
We have identified 8 site by year combinations for which we would like to obtain paired mammal hair and fecal samples for further study! Now we will look for available mammal samples.
Load and explore NEON Biorepository data
Here, we read in a csv file of occurrence records downloaded from the NEON Biorepository data portal. The results are located in the Github repository associated with this tutorial. This represents all small mammal hair and fecal samples from Domains 1, 2, and 7 archived at the NEON Biorepository.
Up to date results for the same search terms can be found at any time here.
biorepo<-read.csv(curl("https://github.com/kyule/neon-biorepo-tutorial/raw/main/biorepoOccurrences_FecalAndHairSamples_20250102.csv"))
Let's look at what information is included in a Darwin Core occurrence record. What variables exist in the records?
names(biorepo)
## [1] "id" "institutionCode" "collectionCode"
## [4] "ownerInstitutionCode" "basisOfRecord" "occurrenceID"
## [7] "catalogNumber" "otherCatalogNumbers" "higherClassification"
## [10] "kingdom" "phylum" "class"
## [13] "order" "family" "scientificName"
## [16] "taxonID" "scientificNameAuthorship" "genus"
## [19] "subgenus" "specificEpithet" "verbatimTaxonRank"
## [22] "infraspecificEpithet" "taxonRank" "identifiedBy"
## [25] "dateIdentified" "identificationReferences" "identificationRemarks"
## [28] "taxonRemarks" "identificationQualifier" "typeStatus"
## [31] "recordedBy" "associatedCollectors" "recordNumber"
## [34] "eventDate" "eventDate2" "year"
## [37] "month" "day" "startDayOfYear"
## [40] "endDayOfYear" "verbatimEventDate" "occurrenceRemarks"
## [43] "habitat" "substrate" "verbatimAttributes"
## [46] "fieldNumber" "eventID" "informationWithheld"
## [49] "dataGeneralizations" "dynamicProperties" "associatedOccurrences"
## [52] "associatedSequences" "associatedTaxa" "reproductiveCondition"
## [55] "establishmentMeans" "cultivationStatus" "lifeStage"
## [58] "sex" "individualCount" "samplingProtocol"
## [61] "preparations" "locationID" "continent"
## [64] "waterBody" "islandGroup" "island"
## [67] "country" "stateProvince" "county"
## [70] "municipality" "locality" "locationRemarks"
## [73] "localitySecurity" "localitySecurityReason" "decimalLatitude"
## [76] "decimalLongitude" "geodeticDatum" "coordinateUncertaintyInMeters"
## [79] "verbatimCoordinates" "georeferencedBy" "georeferenceProtocol"
## [82] "georeferenceSources" "georeferenceVerificationStatus" "georeferenceRemarks"
## [85] "minimumElevationInMeters" "maximumElevationInMeters" "minimumDepthInMeters"
## [88] "maximumDepthInMeters" "verbatimDepth" "verbatimElevation"
## [91] "disposition" "language" "recordEnteredBy"
## [94] "modified" "sourcePrimaryKey.dbpk" "collID"
## [97] "recordID" "references"
We see that a large number of Darwin Core fields are present in the results that outline the who, what, where, when, and more of each sample. For fun, let's explore the data. Try grouping or summarizing by any fields that interest you.
# How many samples are included in the results for each collection, species, and sex?
biorepo %>%
group_by(collectionCode,scientificName,sex) %>%
count()
## # A tibble: 83 × 4
## # Groups: collectionCode, scientificName, sex [83]
## collectionCode scientificName sex n
## <chr> <chr> <chr> <int>
## 1 MAMC-FE "" Male 1
## 2 MAMC-FE "Clethrionomys gapperi" Female 15
## 3 MAMC-FE "Clethrionomys gapperi" Male 7
## 4 MAMC-FE "Clethrionomys gapperi" Unknown 2
## 5 MAMC-FE "Microtus pennsylvanicus" Female 33
## 6 MAMC-FE "Microtus pennsylvanicus" Male 34
## 7 MAMC-FE "Microtus pinetorum" Female 8
## 8 MAMC-FE "Microtus pinetorum" Male 6
## 9 MAMC-FE "Mus musculus" Female 38
## 10 MAMC-FE "Mus musculus" Male 46
## # ℹ 73 more rows
An aside on taxonomic identifications: We see several different taxa represented within the results. Most samples are associated with a species-level determination. However, some small mammal species are very difficult to identify while live in the field at some sites. Therefore, some individuals are identified only to genus and others are given a "/" taxon. The latter represent uncertain identification between two species that are difficult to distinguish at a given field site. For example, individuals identified as Peromyscus leucopus/maniculatus could not be confidently determined to be either P. maniculatus or P. leucopus in the field. We will narrow our data to only samples for which a species level identification was made below. Many specimens are also given identification qualifiers, such as "cf. species," which indicates some uncertainty in the field determination. We will ignore those notes for today, but we encourage any researcher interested in NEON small mammal samples to contact the NEON Bioreposity staff (biorepo@asu.edu) to better understand species-level identification confidence. When possible, we are often interested in collaborating with researchers on efforts to confirm identifications.
biorepo <- biorepo %>%
filter(!grepl("/",scientificName),!is.na(specificEpithet))
Narrow the results to a set of samples that fits our research question
We want to include only samples collected from the same site by year combinations we are interested in based on CN ratio data, so we create a site by year column and filter the results.
biorepo <- biorepo %>%
mutate(siteID=substr(locationID,1,4)) %>%
mutate(siteYear=paste(siteID,year,sep=".")) %>%
filter(siteYear %in% CN$siteYear)
Different collectionCodes correspond to different sample types.
"MAMC-HA" corresponds to the Mammal Hair Sample Collection.
"MAMC-FE" corresponds to the Mammal Fecal Sample Collection.
We will separate the hair and fecal samples into seperate data frames for ease of use.
# Extract the hair and fecal samples
hair <- biorepo %>%
filter(collectionCode=="MAMC-HA")
fecal <- biorepo %>%
filter(collectionCode=="MAMC-FE")
We see that there are a large number of samples that fit the site and year criteria of our study. We know that we want to focus on common species because we cannot fully deplete NEON Biorepository resources for future researchers (Sample Use Policy), and we want to make sure we have sufficient within species replication for our analyses (and likely have our own resource constraints!). Therefore, we can subset to species with the most available hair samples.
First we will find how common different species, select the 2 most common species for each site, and add a site by species column
hairBySpecies <- hair %>%
group_by(siteID,scientificName) %>%
count() %>%
arrange(desc(n)) %>%
group_by(siteID) %>%
slice(1:2) %>%
mutate(siteSp=paste(siteID,scientificName,sep="_"))
Now we will filter the hair and fecal samples by these site by species combinations
hair <- hair %>%
mutate(siteSp=paste(siteID,scientificName,sep="_")) %>%
filter(siteSp %in% hairBySpecies$siteSp)
fecal <- fecal %>%
mutate(siteSp=paste(siteID,scientificName,sep="_")) %>%
filter(siteSp %in% hairBySpecies$siteSp)
We now need to determine which fecal and hair samples are associated with the same individual. We can determine this based on the "associatedOccurrences" field in the NEON Biorepository occurrences field. This field provides url links to samples that can be related in a variety of ways based on Darwin Core controlled terminology.
These urls are pipe delimited and contain the "catalogNumber" for related samples. The only relationship between mammal hair and fecal samples in the NEON Biorepository data portal is "derivedFromSameIndividual." We will extract relationships from the "associatedOccurrences" field to create a new data frame of catalogNumbers of paired samples. The code provided below is a for loop that cycles through the sample data.
sampleMatches <- data.frame(hair=c(),fecal=c())
for (i in 1:nrow(fecal)){
matchHair <- hair$catalogNumber[grepl(fecal$catalogNumber[i],hair$associatedOccurrences)][1]
if(is.na(matchHair) == FALSE){
sampleMatches <- rbind(sampleMatches,data.frame(hair=matchHair,fecal=fecal$catalogNumber[i]))
}
}
We will remove duplicate samples from this list.
sampleMatches <- sampleMatches %>% filter(!duplicated(hair))
We find more than 350 unique individuals with paired hair and fecal samples meeting our criteria so far. Let's summarize by the number of samples across site by species combinations and filter to those combinations for which 10 or more individuals are available.
First, we grab the rest of the data associated with the hair samples
# Grab the rest of the data associated with the hair samples
hairMatches <- sampleMatches %>%
left_join(hair,join_by("hair"=="catalogNumber"))
Then, we filter to the combinations for which we can obtain 10 or more paired samples and subset the matching samples
hairMatchSummary <- hairMatches %>%
group_by(siteSp) %>%
count() %>%
filter(n>=10)
hairMatches <- hairMatches %>%
filter(siteSp %in% hairMatchSummary$siteSp)
Finalize a sample list
To finalize the list, we randomly select a sample size of 10 for each species and site combination.
set.seed(85705)
hairMatchSet <- hairMatches[sample(nrow(hairMatches)),] %>%
arrange(desc(siteSp)) %>% group_by(siteSp) %>%
slice(1:10)
We now want to create a data frame representing the full list of samples we would like to request for our project.
# Filter the full data sets to those involved in the request of interest
request <- biorepo %>%
filter(catalogNumber %in% c(hairMatchSet$hair,hairMatchSet$fecal))
We now have a list of 140 samples we could request from the NEON Biorepository via the Sample Request Form.
What other ways may we want to have manipulated or subset the data for our question?
Visualize our request
We might be interested in creating a visualization for a grant proposal in which we planned to use these samples.
To map our samples, let's first create a dataframe with the average geographic location across the selected samples and the CN ratio data.
mapSummary <- hairMatchSet %>%
group_by(siteID) %>%
summarise(lat=mean(decimalLatitude),lng=mean(decimalLongitude)) %>% left_join(CN,join_by("siteID"=="siteID"))
Next, let's add the species to this data frame.
mapSummaryWithSpecies <- hairMatchSet %>%
group_by(siteID,scientificName) %>%
count() %>%
left_join(mapSummary,join_by("siteID"=="siteID")) %>%
spread(scientificName, n)
Create a base map for our data and add minicharts of the species and CN ratio of canopy foliage
basemap <- leaflet() %>%
addTiles() %>%
addProviderTiles(providers$CartoDB.PositronNoLabels) %>%
setView(lng = -75, lat = 42, zoom = 5)
speciesByCN <- basemap %>%
addMinicharts(mapSummaryWithSpecies$lng, mapSummaryWithSpecies$lat,
type ='pie',
chartdata = mapSummaryWithSpecies[,10:12],
width = mapSummaryWithSpecies$meanCN.cfc/2)
speciesByCN
We see that we have a good representation of P. leucopus across our study area. For a strong species-specific analysis we may choose to focus on this species and investigate the CN ratios present at sites where it is present.
PleucSites <- mapSummaryWithSpecies[!is.na(mapSummaryWithSpecies$`Peromyscus leucopus`), ]
ggplot(PleucSites, aes(x = meanCN.ltr, y = meanCN.cfc)) +
geom_point() +
labs(
x = "Litter CN ratio",
y = "Canopy Foliage CN ratio",
) +
theme_minimal()
We see approximately 2-fold variation in both litter and canopy foliage CN ratios across these sites, indicating that a wide range of isotopic environments can be studied.
This is just one of many ways to connect NEON data with available organismal and environmental samples in order to develop new research projects. The NEON Biorepository Data Portal allows you to search the fast growing collection of samples based on a variety of criteria, such as taxonomy, collecting events, preservation type, and more. You are encouraged to reach out to biorepo@asu.edu or fill out the NEON Biorepository Contact Form with any inquiries about NEON samples.