Skip to main content
NSF NEON, Operated by Battelle

Main navigation

  • About Us
    • Overview
      • Spatial and Temporal Design
      • History
    • Vision and Management
    • Advisory Groups
      • Science, Technology & Education Advisory Committee
      • Technical Working Groups (TWGs)
    • FAQ
    • Contact Us
      • Contact NEON Biorepository
      • Field Offices
    • User Accounts
    • Staff
    • Code of Conduct

    About Us

  • Data & Samples
    • Data Portal
      • Explore Data Products
      • Data Availability Charts
      • Spatial Data & Maps
      • Document Library
      • API & GraphQL
      • Prototype Data
      • External Lab Data Ingest (restricted)
    • Data Themes
      • Biogeochemistry
      • Ecohydrology
      • Land Cover and Processes
      • Organisms, Populations, and Communities
    • Samples & Specimens
      • Discover and Use NEON Samples
        • Sample Types
        • Sample Repositories
        • Sample Explorer
        • Megapit and Distributed Initial Characterization Soil Archives
      • Sample Processing
      • Sample Quality
      • Taxonomic Lists
    • Collection Methods
      • Protocols & Standardized Methods
      • Airborne Remote Sensing
        • Flight Box Design
        • Flight Schedules and Coverage
        • Daily Flight Reports
          • AOP Flight Report Sign Up
        • Camera
        • Imaging Spectrometer
        • Lidar
      • Automated Instruments
        • Site Level Sampling Design
        • Sensor Collection Frequency
        • Instrumented Collection Types
          • Meteorology
          • Phenocams
          • Soil Sensors
          • Ground Water
          • Surface Water
      • Observational Sampling
        • Site Level Sampling Design
        • Sampling Schedules
        • Observation Types
          • Aquatic Organisms
            • Aquatic Microbes
            • Fish
            • Macroinvertebrates & Zooplankton
            • Periphyton, Phytoplankton, and Aquatic Plants
          • Terrestrial Organisms
            • Birds
            • Ground Beetles
            • Mosquitoes
            • Small Mammals
            • Soil Microbes
            • Terrestrial Plants
            • Ticks
          • Hydrology & Geomorphology
            • Discharge
            • Geomorphology
          • Biogeochemistry
          • DNA Sequences
          • Pathogens
          • Sediments
          • Soils
            • Soil Descriptions
        • Optimizing the Observational Sampling Designs
    • Data Notifications
    • Data Guidelines and Policies
      • Acknowledging and Citing NEON
      • Publishing Research Outputs
      • Usage Policies
    • Data Management
      • Data Availability
      • Data Formats and Conventions
      • Data Processing
      • Data Quality
      • Data Product Bundles
      • Data Product Revisions and Releases
        • Release 2021
        • Release 2022
        • Release 2023
        • Release 2024
        • Release-2025
      • NEON and Google
      • Externally Hosted Data

    Data & Samples

  • Field Sites
    • About Field Sites and Domains
    • Explore Field Sites
    • Site Management Data Product

    Field Sites

  • Impact
    • Observatory Blog
    • Case Studies
    • Papers & Publications
    • Newsroom
      • NEON in the News
      • Newsletter Archive
      • Newsletter Sign Up

    Impact

  • Resources
    • Getting Started with NEON Data & Resources
    • Documents and Communication Resources
      • Papers & Publications
      • Document Library
      • Outreach Materials
    • Code Hub
      • Code Resources Guidelines
      • Code Resources Submission
      • NEON's GitHub Organization Homepage
    • Learning Hub
      • Science Videos
      • Tutorials
      • Workshops & Courses
      • Teaching Modules
    • Research Support Services
      • Field Site Coordination
      • Letters of Support
      • Mobile Deployment Platforms
      • Permits and Permissions
      • AOP Flight Campaigns
      • Research Support FAQs
      • Research Support Projects
    • Funding Opportunities

    Resources

  • Get Involved
    • Advisory Groups
      • Science, Technology & Education Advisory Committee
      • Technical Working Groups
    • Upcoming Events
    • NEON Ambassador Program
      • Exploring NEON-Derived Data Products Workshop Series
    • Research and Collaborations
      • Environmental Data Science Innovation and Inclusion Lab
      • Collaboration with DOE BER User Facilities and Programs
      • EFI-NEON Ecological Forecasting Challenge
      • NEON Great Lakes User Group
      • NEON Science Summit
      • NCAR-NEON-Community Collaborations
        • NCAR-NEON Community Steering Committee
    • Community Engagement
      • How Community Feedback Impacts NEON Operations
    • Science Seminars and Data Skills Webinars
      • Past Years
    • Work Opportunities
      • Careers
      • Seasonal Fieldwork
      • Internships
        • Intern Alumni
    • Partners

    Get Involved

  • My Account
  • Search

Search

Time Series 04: Subset and Manipulate Time Series Data with dplyr

In this tutorial, we will use the group_by, summarize and mutate functions in the dplyr package to efficiently manipulate atmospheric data collected at the NEON Harvard Forest Field Site. We will use pipes to efficiently perform multiple tasks within a single chunk of code.

Learning Objectives

After completing this tutorial, you will be able to:

  • Explain several ways to manipulate data using functions in the dplyr package in R.
  • Use group-by(), summarize(), and mutate() functions.
  • Write and understand R code with pipes for cleaner, efficient coding.
  • Use the year() function from the lubridate package to extract year from a date-time class variable.

Things You’ll Need To Complete This Tutorial

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Install R Packages

  • lubridate: install.packages("lubridate")
  • dplyr: install.packages("dplyr")
  • ggplot2: install.packages("ggplot2")

More on Packages in R – Adapted from Software Carpentry.

Download Data

NEON Teaching Data Subset: Meteorological Data for Harvard Forest

The data used in this lesson were collected at the National Ecological Observatory Network's Harvard Forest field site. These data are proxy data for what will be available for 30 years on the NEON data portal for the Harvard Forest and other field sites located across the United States.

Download Dataset


Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.


Additional Resources

  • NEON Data Skills tutorial on spatial data and piping with dplyr
  • Data Carpentry lesson's on Aggregating and Analyzing Data with dplyr
  • dplyr package description.
  • RStudio Introduction to dplyr

Introduction to dplyr

The dplyr package simplifies and increases efficiency of complicated yet commonly performed data "wrangling" (manipulation / processing) tasks. It uses the data_frame object as both an input and an output.

Load the Data

We will need the lubridate and the dplyr packages to complete this tutorial.

We will also use the 15-minute average atmospheric data subsetted to 2009-2011 for the NEON Harvard Forest Field Site. This subset was created in the Subsetting Time Series Data tutorial.

If this object isn't already created, please load the .csv version: Met_HARV_15min_2009_2011.csv. Be sure to convert the date-time column to a POSIXct class after the .csv is loaded.

# it's good coding practice to load packages at the top of a script

library(lubridate) # work with dates
library(dplyr)     # data manipulation (filter, summarize, mutate)
library(ggplot2)   # graphics
library(gridExtra) # tile several plots next to each other
library(scales)

# set working directory to ensure R can find the file we wish to import
wd <- "~/Git/data/"

# 15-min Harvard Forest met data, 2009-2011
harMet15.09.11<- read.csv(
  file=paste0(wd,"NEON-DS-Met-Time-Series/HARV/FisherTower-Met/Met_HARV_15min_2009_2011.csv"),
  stringsAsFactors = FALSE)

# convert datetime to POSIXct
harMet15.09.11$datetime<-as.POSIXct(harMet15.09.11$datetime,
                    format = "%Y-%m-%d %H:%M",
                    tz = "America/New_York")

Explore The Data

Remember that we are interested in the drivers of phenology including - air temperature, precipitation, and PAR (photosynthetic active radiation - or the amount of visible light). Using the 15-minute averaged data, we could easily plot each of these variables.

Daily Meteorological Conditions at Harvard Forest Between 2009 and 2011

However, summarizing the data at a coarser scale (e.g., daily, weekly, by season, or by year) may be easier to visually interpret during initial stages of data exploration.

Extract Year from a Date-Time Column

To summarize by year efficiently, it is helpful to have a year column that we can use to group by. We can use the lubridate function year() to extract the year only from a date-time class R column.

# create a year column
harMet15.09.11$year <- year(harMet15.09.11$datetime)

Using names() we see that we now have a year column in our data_frame.

# check to make sure it worked
names(harMet15.09.11)

##  [1] "X"        "datetime" "jd"       "airt"     "f.airt"   "rh"      
##  [7] "f.rh"     "dewp"     "f.dewp"   "prec"     "f.prec"   "slrr"    
## [13] "f.slrr"   "parr"     "f.parr"   "netr"     "f.netr"   "bar"     
## [19] "f.bar"    "wspd"     "f.wspd"   "wres"     "f.wres"   "wdir"    
## [25] "f.wdir"   "wdev"     "f.wdev"   "gspd"     "f.gspd"   "s10t"    
## [31] "f.s10t"   "year"

str(harMet15.09.11$year)

##  num [1:105108] 2009 2009 2009 2009 2009 ...

Now that we have added a year column to our data_frame, we can use dplyr to summarize our data.

Manipulate Data using dplyr

Let's start by extracting a yearly air temperature value for the Harvard Forest data. To calculate a yearly average, we need to:

  1. Group our data by year.
  2. Calculate the mean precipitation value for each group (ie for each year).

We will use dplyr functions group_by and summarize to perform these steps.

# Create a group_by object using the year column 
HARV.grp.year <- group_by(harMet15.09.11, # data_frame object
                          year) # column name to group by

# view class of the grouped object
class(HARV.grp.year)

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

The group_by function creates a "grouped" object. We can then use this grouped object to quickly calculate summary statistics by group - in this case, year. For example, we can count how many measurements were made each year using the tally() function. We can then use the summarize function to calculate the mean air temperature value each year. Note that "summarize" is a common function name across many different packages. If you happen to have two packages loaded at the same time that both have a "summarize" function, you may not get the results that you expect. Therefore, we will "disambiguate" our function by specifying which package it comes from using the double colon notation dplyr::summarize().

# how many measurements were made each year?
tally(HARV.grp.year)

## # A tibble: 3 x 2
##    year     n
##   <dbl> <int>
## 1  2009 35036
## 2  2010 35036
## 3  2011 35036

# what is the mean airt value per year?
dplyr::summarize(HARV.grp.year, 
          mean(airt)   # calculate the annual mean of airt
          ) 

## # A tibble: 3 x 2
##    year `mean(airt)`
##   <dbl>        <dbl>
## 1  2009        NA   
## 2  2010        NA   
## 3  2011         8.75

Why did this return a NA value for years 2009 and 2010? We know there are some values for both years. Let's check our data for NoData values.

# are there NoData values?
sum(is.na(HARV.grp.year$airt))

## [1] 2

# where are the no data values
# just view the first 6 columns of data
HARV.grp.year[is.na(HARV.grp.year$airt),1:6]

## # A tibble: 2 x 6
##        X datetime               jd  airt f.airt    rh
##    <int> <dttm>              <int> <dbl> <chr>  <int>
## 1 158360 2009-07-08 14:15:00   189    NA M         NA
## 2 203173 2010-10-18 09:30:00   291    NA M         NA

It appears as if there are two NoData values (in 2009 and 2010) that are causing R to return a NA for the mean for those years. As we learned previously, we can use na.rm to tell R to ignore those values and calculate the final mean value.

# calculate mean but remove NA values
dplyr::summarize(HARV.grp.year, 
          mean(airt, na.rm = TRUE)
          )

## # A tibble: 3 x 2
##    year `mean(airt, na.rm = TRUE)`
##   <dbl>                      <dbl>
## 1  2009                       7.63
## 2  2010                       9.03
## 3  2011                       8.75

Great! We've now used the group_by function to create groups for each year of our data. We then calculated a summary mean value per year using summarize.

dplyr Pipes

The above steps utilized several steps of R code and created 1 R object - HARV.grp.year. We can combine these steps using pipes in the dplyr package.

We can use pipes to string functions or processing steps together. The output of each step is fed directly into the next step using the syntax: %>%. This means we don't need to save the output of each intermediate step as a new R object.

A few notes about piping syntax:

  • A pipe begins with the object name that we will be manipulating, in our case harMet15.09.11.
  • It then links that object with first manipulation step using %>%.
  • Finally, the first function is called, in our case group_by(year). Note that because we specified the object name in step one above, we can just use the column name

So, we have: harMet15.09.11 %>% group_by(year)

  • We can then add an additional function (or more functions!) to our pipe. For example, we can tell R to tally or count the number of measurements per year.

harMet15.09.11 %>% group_by(year) %>% tally()

Let's try it!

# how many measurements were made a year?
harMet15.09.11 %>% 
  group_by(year) %>%  # group by year
  tally() # count measurements per year

## # A tibble: 3 x 2
##    year     n
##   <dbl> <int>
## 1  2009 35036
## 2  2010 35036
## 3  2011 35036

Piping allows us to efficiently perform operations on our data_frame in that:

  1. It allows us to condense our code, without naming intermediate steps.
  2. The dplyr package is optimized to ensure fast processing!

We can use pipes to summarize data by year too:

# what was the annual air temperature average 
year.sum <- harMet15.09.11 %>% 
  group_by(year) %>%  # group by year
  dplyr::summarize(mean(airt, na.rm=TRUE))

# what is the class of the output?
year.sum

## # A tibble: 3 x 2
##    year `mean(airt, na.rm = TRUE)`
##   <dbl>                      <dbl>
## 1  2009                       7.63
## 2  2010                       9.03
## 3  2011                       8.75

# view structure of output
str(year.sum)

## tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
##  $ year                    : num [1:3] 2009 2010 2011
##  $ mean(airt, na.rm = TRUE): num [1:3] 7.63 9.03 8.75
### Challenge: Using Pipes Use piping to create a `data_frame` called `jday.avg` that contains the average `airt` per Julian day (`harMet15.09.11$jd`). Plot the output using `qplot`.

Average Temperature by Julian Date at Harvard Forest Between 2009 and 2011

**Data Tip:** Older `dplyr` versions used the `%.%` syntax to designate a pipe. Pipes are sometimes referred to as chains.

Other dplyr Functions

dplyr works based on a series of verb functions that allow us to manipulate the data in different ways:

  • filter() & slice(): filter rows based on values in specified columns
  • group-by(): group all data by a column
  • arrange(): sort data by values in specified columns
  • select() & rename(): view and work with data from only specified columns
  • distinct(): view and work with only unique values from specified columns
  • mutate() & transmute(): add new data to a data frame
  • summarise(): calculate the specified summary statistics
  • sample_n() & sample_frac(): return a random sample of rows

(List modified from the CRAN dplyr package description. )

The syntax for all dplyr functions is similar:

  • the first argument is the target data_frame,
  • the subsequent arguments dictate what to do with that data_frame and
  • the output is a new data frame.

Group by a Variable (or Two)

Our goal for this tutorial is to view drivers of annual phenology patterns. Specifically, we want to explore daily average temperature throughout the year. This means we need to calculate average temperature, for each day, across three years. To do this we can use the group_by() function as we did earlier. However this time, we can group by two variables: year and Julian Day (jd) as follows:

group_by(year, jd)

Let's begin by counting or tallying the total measurements per Julian day (or year day) using the group_by() function and pipes.

harMet15.09.11 %>%         # use the harMet15.09.11 data_frame
  group_by(year, jd) %>%   # group data by Year & Julian day
  tally()                  # tally (count) observations per jd / year

## # A tibble: 1,096 x 3
## # Groups:   year [3]
##     year    jd     n
##    <dbl> <int> <int>
##  1  2009     1    96
##  2  2009     2    96
##  3  2009     3    96
##  4  2009     4    96
##  5  2009     5    96
##  6  2009     6    96
##  7  2009     7    96
##  8  2009     8    96
##  9  2009     9    96
## 10  2009    10    96
## # … with 1,086 more rows

The output shows we have 96 values for each day. Is that what we expect?

24*4  # 24 hours/day * 4 15-min data points/hour

## [1] 96
**Data Tip:** If Julian days weren't already in our data, we could use the `yday()` function from the `lubridate` package to create a new column containing Julian day values. More information in this NEON Data Skills tutorial.

Summarize by Group

We can use summarize() function to calculate a summary output value for each "group" - in this case Julian day per year. Let's calculate the mean air temperature for each Julian day per year. Note that we are still using na.rm=TRUE to tell R to skip NA values.

harMet15.09.11 %>%         # use the harMet15.09.11 data_frame
  group_by(year, jd) %>%   # group data by Year & Julian day
  dplyr::summarize(mean_airt = mean(airt, na.rm = TRUE))  # mean airtemp per jd / year

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

## # A tibble: 1,096 x 3
## # Groups:   year [3]
##     year    jd mean_airt
##    <dbl> <int>     <dbl>
##  1  2009     1    -15.1 
##  2  2009     2     -9.14
##  3  2009     3     -5.54
##  4  2009     4     -6.35
##  5  2009     5     -2.41
##  6  2009     6     -4.92
##  7  2009     7     -2.59
##  8  2009     8     -3.23
##  9  2009     9     -9.92
## 10  2009    10    -11.1 
## # … with 1,086 more rows
### Challenge: Summarization & Calculations with dplyr We can use `sum` to calculate the total rather than mean value for each Julian Day. Using this information, do the following:
  1. Calculate total prec for each Julian Day over the 3 years - name your data_frame total.prec.
  2. Create a plot of Daily Total Precipitation for 2009-2011.
  3. Add a title and x and y axis labels.
  4. If you use qplot to create your plot, use colour=as.factor(total.prec$year) to color the data points by year.
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Average Daily Precipitation (mm) at Harvard Forest by Julian Date for the time period 2009 - 2011

Mutate - Add data_frame Columns to dplyr Output

We can use the mutate() function of dplyr to add additional columns of information to a data_frame. For instance, we added the year column independently at the very beginning of the tutorial. However, we can add the year using a dplyr pipe that also summarizes our data. To do this, we would use the syntax:

mutate(year2 = year(datetime))

year2 is the name of the column that will be added to the output dplyr data_frame.

harMet15.09.11 %>%
  mutate(year2 = year(datetime)) %>%
  group_by(year2, jd) %>%
  dplyr::summarize(mean_airt = mean(airt, na.rm = TRUE))

## `summarise()` has grouped output by 'year2'. You can override using the `.groups` argument.

## # A tibble: 1,096 x 3
## # Groups:   year2 [3]
##    year2    jd mean_airt
##    <dbl> <int>     <dbl>
##  1  2009     1    -15.1 
##  2  2009     2     -9.14
##  3  2009     3     -5.54
##  4  2009     4     -6.35
##  5  2009     5     -2.41
##  6  2009     6     -4.92
##  7  2009     7     -2.59
##  8  2009     8     -3.23
##  9  2009     9     -9.92
## 10  2009    10    -11.1 
## # … with 1,086 more rows
**Data Tip:** The `mutate` function is similar to `transform()` in base R. However,`mutate()` allows us to create and immediately use the variable (`year2`).

Save dplyr Output as data_frame

We can save output from a dplyr pipe as a new R object to use for quick plotting.

harTemp.daily.09.11<-harMet15.09.11 %>%
                    mutate(year2 = year(datetime)) %>%
                    group_by(year2, jd) %>%
                    dplyr::summarize(mean_airt = mean(airt, na.rm = TRUE))

## `summarise()` has grouped output by 'year2'. You can override using the `.groups` argument.

head(harTemp.daily.09.11)

## # A tibble: 6 x 3
## # Groups:   year2 [1]
##   year2    jd mean_airt
##   <dbl> <int>     <dbl>
## 1  2009     1    -15.1 
## 2  2009     2     -9.14
## 3  2009     3     -5.54
## 4  2009     4     -6.35
## 5  2009     5     -2.41
## 6  2009     6     -4.92

Add Date-Time To dplyr Output

In the challenge above, we created a plot of daily precipitation data using qplot. However, the x-axis ranged from 0-366 (Julian Days for the year). It would have been easier to create a meaningful plot across all three years if we had a continuous date variable in our data_frame representing the year and date for each summary value.

We can add the the datetime column value to our data_frame by adding an additional argument to the summarize() function. In this case, we will add the first datetime value that R encounters when summarizing data by group as follows:

datetime = first(datetime)

Our new summarize statement in our pipe will look like this:

summarize(mean_airt = mean(airt, na.rm = TRUE), datetime = first(datetime))

Let's try it!

# add in a datatime column
harTemp.daily.09.11 <- harMet15.09.11 %>%
  mutate(year3 = year(datetime)) %>%
  group_by(year3, jd) %>%
  dplyr::summarize(mean_airt = mean(airt, na.rm = TRUE), datetime = first(datetime))

## `summarise()` has grouped output by 'year3'. You can override using the `.groups` argument.

# view str and head of data
str(harTemp.daily.09.11)

## tibble [1,096 × 4] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ year3    : num [1:1096] 2009 2009 2009 2009 2009 ...
##  $ jd       : int [1:1096] 1 2 3 4 5 6 7 8 9 10 ...
##  $ mean_airt: num [1:1096] -15.13 -9.14 -5.54 -6.35 -2.41 ...
##  $ datetime : POSIXct[1:1096], format: "2009-01-01 00:15:00" ...
##  - attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ year3: num [1:3] 2009 2010 2011
##   ..$ .rows: list<int> [1:3] 
##   .. ..$ : int [1:366] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:365] 367 368 369 370 371 372 373 374 375 376 ...
##   .. ..$ : int [1:365] 732 733 734 735 736 737 738 739 740 741 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

head(harTemp.daily.09.11)

## # A tibble: 6 x 4
## # Groups:   year3 [1]
##   year3    jd mean_airt datetime           
##   <dbl> <int>     <dbl> <dttm>             
## 1  2009     1    -15.1  2009-01-01 00:15:00
## 2  2009     2     -9.14 2009-01-02 00:15:00
## 3  2009     3     -5.54 2009-01-03 00:15:00
## 4  2009     4     -6.35 2009-01-04 00:15:00
## 5  2009     5     -2.41 2009-01-05 00:15:00
## 6  2009     6     -4.92 2009-01-06 00:15:00
### Challenge: Combined dplyr Skills
  1. Plot daily total precipitation from 2009-2011 as we did in the previous challenge. However this time, use the new syntax that you learned (mutate and summarize to add a datetime column to your output data_frame).
  2. Create a data_frame of the average monthly air temperature for 2009-2011. Name the new data frame harTemp.monthly.09.11. Plot your output.
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Daily Precipitation at Harvard Forest Between 2009 and 2011

## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.

Monthly Mean Temperature at Harvard Forest Between 2009 and 2011

Time Series 03: Cleaning & Subsetting Time Series Data in R - NoData Values & Subset by Date

This tutorial explores how to deal with NoData values encountered in a time series dataset, in R. It also covers how to subset large files by date and export the results to a .csv (text) file.

Learning Objectives

After completing this tutorial, you will be able to:

  • Subset data by date.
  • Search for NA or missing data values.
  • Describe different possibilities on how to deal with missing data.

Things You’ll Need To Complete This Tutorial

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Install R Packages

  • lubridate: install.packages("lubridate")
  • ggplot2: install.packages("ggplot2")

More on Packages in R – Adapted from Software Carpentry.

Download Data

NEON Teaching Data Subset: Meteorological Data for Harvard Forest

The data used in this lesson were collected at the National Ecological Observatory Network's Harvard Forest field site. These data are proxy data for what will be available for 30 years on the NEON data portal for the Harvard Forest and other field sites located across the United States.

Download Dataset


Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

Cleaning Time Series Data

It is common to encounter, large files containing more data than we need for our analysis. It is also common to encounter NoData values that we need to account for when analyzing our data.

In this tutorial, we'll learn how to both manage NoData values and also subset and export a portion of an R object as a new .csv file.

In this tutorial, we will work with atmospheric data, containing air temperature, precipitation, and photosynthetically active radiation (PAR) data - metrics that are key drivers of phenology. Our study area is the NEON Harvard Forest Field Site.

Import Timeseries Data

We will use the lubridate and ggplot2 packages. Let's load those first.

If you have not already done so, import the hf001-10-15min-m.csv file, which contains atmospheric data for Harvard Forest. Convert the datetime column to a POSIXct class as covered in the tutorial: Dealing With Dates & Times in R - as.Date, POSIXct, POSIXlt.

# Load packages required for entire script
library(lubridate)  # work with dates
library(ggplot2)  # plotting

# set working directory to ensure R can find the file we wish to import
wd <- "~/Git/data/"

# Load csv file containing 15 minute averaged atmospheric data 
# for the NEON Harvard Forest Field Site

# Factors=FALSE so data are imported as numbers and characters 
harMet_15Min <- read.csv(
  file=paste0(wd,"NEON-DS-Met-Time-Series/HARV/FisherTower-Met/hf001-10-15min-m.csv"),
  stringsAsFactors = FALSE)

# convert to POSIX date time class - US Eastern Time Zone
harMet_15Min$datetime <- as.POSIXct(harMet_15Min$datetime,
                                format = "%Y-%m-%dT%H:%M",
                                tz = "America/New_York")

Subset by Date

Our .csv file contains nearly a decade's worth of data which makes for a large file. The time period we are interested in for our study is:

  • Start Time: 1 January 2009
  • End Time: 31 Dec 2011

Let's subset the data to only contain these three years. We can use the subset() function, with the syntax: NewObject <- subset ( ObjectToBeSubset, CriteriaForSubsetting ) .

We will set our criteria to be any datetime that:

  1. Is greater than or equal to 1 Jan 2009 at 0:00 AND
  2. Is less than or equal to 31 Dec 2011 at 23:59.

We also need to specify the timezone so R can handle daylight savings and leap year.

# subset data - 2009-2011
harMet15.09.11 <- subset(harMet_15Min,
                         datetime >= as.POSIXct('2009-01-01 00:00',
                                                tz = "America/New_York") &
                         datetime <= as.POSIXct('2011-12-31 23:59',
                                               tz = "America/New_York"))

# View first and last records in the object 
head(harMet15.09.11[1])

##                   datetime
## 140255 2009-01-01 00:00:00
## 140256 2009-01-01 00:15:00
## 140257 2009-01-01 00:30:00
## 140258 2009-01-01 00:45:00
## 140259 2009-01-01 01:00:00
## 140260 2009-01-01 01:15:00

tail(harMet15.09.11[1])

##                   datetime
## 245369 2011-12-31 22:30:00
## 245370 2011-12-31 22:45:00
## 245371 2011-12-31 23:00:00
## 245372 2011-12-31 23:15:00
## 245373 2011-12-31 23:30:00
## 245374 2011-12-31 23:45:00

It worked! The first entry is 1 January 2009 at 00:00 and the last entry is 31 December 2011 at 23:45.

Export data.frame to .CSV

We can export this subset in .csv format to use in other analyses or to share with colleagues using write.csv.

**Data Tip:** Remember, to give your output files concise, yet descriptive names so you can identify what it contains in the future. By default, the `.csv` file will be written to your working directory.
# write harMet15 subset data to .csv
write.csv(harMet15.09.11, 
          file=paste0(wd,"Met_HARV_15min_2009_2011.csv"))
### Challenge: Subset & Plot Data
  1. Create a plot of precipitation for the month of July 2010 in Harvard Forest. Be sure to label x and y axes. Also be sure to give your plot a title.

  2. Create a plot of dew point (dewp) for the year 2011 at Harvard Forest.

Bonus challenge: Complete this challenge using the available daily data instead of the 15-minute data. What will need to change in your subsetting code?

Daily Rainfall at Harvard Forest for the month of July, 2010Daily Dewpoint at Harvard Forest for the month of July, 2011

Managing Missing Data: NoData values

Find NoData Values

If we are lucky when working with external data, the NoData value is clearly specified in the metadata. No data values can be stored differently:

  • NA / NaN: Sometimes this value is NA or NaN (not a number).
  • A Designated Numeric Value (e.g. -9999): Character strings such as NA can not always be stored along side of numeric values in some file formats. Sometimes you'll encounter numeric placeholders for noData values such as -9999 (a value often used in the GIS / Remote Sensing and Micrometeorology domains.
  • Blank Values: sometimes noData values are left blank. Blanks are particularly problematic because we can't be certain if a data value is purposefully missing (not measured that day or a bad measurement) or if someone unintentionally deleted it.

Because the actual value used to designate missing data can vary depending upon what data we are working with, it is important to always check the metadata for the files associated NoData value. If the value is NA, we are in luck, R will recognize and flag this value as NoData. If the value is numeric (e.g., -9999), then we might need to assign this value to NA.

**Data Tip:** `NA` values will be ignored when performing calculations in R. However a `NoData` value of `-9999` will be recognized as an integer and processed accordingly. If you encounter a numeric `NoData` value be sure to assign it to `NA` in R: `objectName[objectName==-9999] <- NA`

In the Why Metadata Are Important: How to Work with Metadata in Text & EML Format tutorial, we viewed the metadata for these data and discovered that missing values are designated using NA - a common NoData value placeholder.

Excerpt from the metadata: airt: average air temperature. Average of daily averages. (unit: celsius / missing value: NA)

Check For NoData Values

We can quickly check for NoData values in our data using theis.na() function. By asking for the sum() of is.na() we can see how many NA/ missing values we have.

# Check for NA values
sum(is.na(harMet15.09.11$datetime))

## [1] 0

sum(is.na(harMet15.09.11$airt))

## [1] 2

# view rows where the air temperature is NA 
harMet15.09.11[is.na(harMet15.09.11$airt),]

##                   datetime  jd airt f.airt rh f.rh dewp f.dewp prec
## 158360 2009-07-08 14:15:00 189   NA      M NA    M   NA      M    0
## 203173 2010-10-18 09:30:00 291   NA      M NA    M   NA      M    0
##        f.prec slrr f.slrr parr f.parr netr f.netr bar f.bar wspd
## 158360         290         485         139         NA     M  2.1
## 203173          NA      M   NA      M   NA      M  NA     M   NA
##        f.wspd wres f.wres wdir f.wdir wdev f.wdev gspd f.gspd s10t
## 158360         1.8          86          29         5.2        20.7
## 203173      M   NA      M   NA      M   NA      M   NA      M 10.9
##        f.s10t
## 158360       
## 203173

The results above tell us there are NoData values in the airt column. However, there are NoData values in other variables.

### Challenge: NoData Values

How many NoData values are in the precipitation (prec) and PAR (parr) columns of our data?

Deal with NoData Values

When we encounter NoData values (blank, NaN, -9999, etc.) in our data we need to decide how to deal with them. By default R treats NoData values designated with a NA as a missing value rather than a zero. This is good, as a value of zero (no rain today) is not the same as missing data (e.g. we didn't measure the amount of rainfall today).

How we deal with NoData values will depend on:

  • the data type we are working with
  • the analysis we are conducting
  • the significance of the gap or missing value

Many functions in R contains a na.rm= option which will allow you to tell R to ignore NA values in your data when performing calculations.

To Gap Fill? Or Not?

Sometimes we might need to "gap fill" our data. This means we will interpolate or estimate missing values often using statistical methods. Gap filling can be complex and is beyond the scope of this tutorial. The take away from this is simply that it is important to acknowledge missing values in your data and to carefully consider how you wish to account for them during analysis.

Other resources:

  1. R code for dealing with missing data: Quick-R: Missing Data

Managing NoData Values in Our Data

For this tutorial, we are exploring the patterns of precipitation, and temperature as they relate to green-up and brown-down of vegetation at Harvard Forest. To view overall trends during these early exploration stages, it is okay for us to leave out the NoData values in our plots.

**Data Tip:** If we wanted to perform more advanced statistical analysis, we might consider gap-filling as our next step. Many data products, from towers such as FluxNet include a higher level, gap-filled product that we can download. More on Gap Filling

NoData Values Can Impact Calculations

It is important to consider NoData values when performing calculations on our data. For example, R will not properly calculate certain functions if there are NA values in the data, unless we explicitly tell it to ignore them.

# calculate mean of air temperature
mean(harMet15.09.11$airt)

## [1] NA

# are there NA values in our data?
sum(is.na(harMet15.09.11$airt))

## [1] 2

R will not return a value for the mean as there NA values in the air temperature column. Because there are only 2 missing values (out of 105,108) for air temperature, we aren't that worried about a skewed 3 year mean. We can tell R to ignore noData values in the mean calculations using na.rm= (NA.remove).

# calculate mean of air temperature, ignore NA values
mean(harMet15.09.11$airt, 
     na.rm=TRUE)

## [1] 8.467904

We now see that the 3-year average air temperature is 8.5°C.

### Challenge: Import, Understand Metadata, and Clean a Data Set We have been using the 15-minute data from the Harvard Forest. However, overall we are interested in larger scale patterns of greening-up and browning-down. Thus a daily summary is sufficient for us to see overall trends.
  1. Import the Daily Meteorological data from the Harvard Forest (if you haven't already done so in the Intro to Time Series Data in R tutorial.)
  2. Check the metadata to see what the column names are for the variable of interest (precipitation, air temperature, PAR, day and time ).
  3. If needed, convert the data class of different columns.
  4. Check for missing data and decide what to do with any that exist.
  5. Subset out the data for the duration of our study: 2009-2011. Name the object "harMetDaily.09.11".
  6. Export the subset to a .csv file.
  7. Create a plot of Daily Air Temperature for 2009-2011. Be sure to label, x- and y-axes. Also give the plot a title!

Relationship between Date and Daily Average Temperature at Harvard Forest between 2009 and 2012

Time Series 02: Dealing With Dates & Times in R - as.Date, POSIXct, POSIXlt

This tutorial explores working with date and time field in R. We will overview the differences between as.Date, POSIXct and POSIXlt as used to convert a date / time field in character (string) format to a date-time format that is recognized by R. This conversion supports efficient plotting, subsetting and analysis of time series data.

Learning Objectives

After completing this tutorial, you will be able to:

  • Describe various date-time classes and data structure in R.
  • Explain the difference between POSIXct and POSIXlt data classes are and why POSIXct may be preferred for some tasks.
  • Convert a column containing date-time information in character format to a date-time R class.
  • Convert a date-time column to different date-time classes.
  • Write out a date-time class object in different ways (month-day, month-day-year, etc).

Things You’ll Need To Complete This Tutorials

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Install R Packages

  • lubridate: install.packages("lubridate")

More on Packages in R – Adapted from Software Carpentry.

Download Data

NEON Teaching Data Subset: Meteorological Data for Harvard Forest

The data used in this lesson were collected at the National Ecological Observatory Network's Harvard Forest field site. These data are proxy data for what will be available for 30 years on the NEON data portal for the Harvard Forest and other field sites located across the United States.

Download Dataset


Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

The Data Approach

Intro to Time Series Data in R tutorial we imported a time series dataset in .csv format into R. We learned how to quickly plot these data by converting the date column to an R Date class. In this tutorial we will explore how to work with a column that contains both a date AND a time stamp.

We will use functions from both base R and the lubridate package to work with date-time data classes.

# Load packages required for entire script
library(lubridate)  #work with dates

#Set the working directory and place your downloaded data there
wd <- "~/Git/data/"

Import CSV File

First, let's import our time series data. We are interested in temperature, precipitation and photosynthetically active radiation (PAR) - metrics that are strongly associated with vegetation green-up and brown down (phenology or phenophase timing). We will use the hf001-10-15min-m.csv file that contains atmospheric data for the NEON Harvard Forest field site, aggregated at 15-minute intervals. Download the dataset for these exercises here.

# Load csv file of 15 min meteorological data from Harvard Forest
# contained within the downloaded directory, or available for download
# directly from:
# https://harvardforest.fas.harvard.edu/data/p00/hf001/hf001-10-15min-m.csv
# Factors=FALSE so strings, series of letters/words/numerals, remain characters
harMet_15Min <- read.csv(
  file=paste0(wd,"NEON-DS-Met-Time-Series/HARV/FisherTower-Met/hf001-10-15min-m.csv"),
  stringsAsFactors = FALSE)

Date and Time Data

Let's revisit the data structure of our harMet_15Min object. What is the class of the date-time column?

# view column data class
class(harMet_15Min$datetime)

## [1] "character"

# view sample data
head(harMet_15Min$datetime)

## [1] "2005-01-01T00:15" "2005-01-01T00:30" "2005-01-01T00:45"
## [4] "2005-01-01T01:00" "2005-01-01T01:15" "2005-01-01T01:30"

Our datetime column is stored as a character class. We need to convert it to date-time class. What happens when we use the as.Date method that we learned about in the Intro to Time Series Data in R tutorial?

# convert column to date class
dateOnly_HARV <- as.Date(harMet_15Min$datetime)

# view data
head(dateOnly_HARV)

## [1] "2005-01-01" "2005-01-01" "2005-01-01" "2005-01-01" "2005-01-01"
## [6] "2005-01-01"

When we use as.Date, we lose the time stamp.

Explore Date and Time Classes

R - Date Class - as.Date

As we just saw, the as.Date format doesn't store any time information. When we use theas.Date method to convert a date stored as a character class to an R date class, it will ignore all values after the date string.

# Convert character data to date (no time) 
myDate <- as.Date("2015-10-19 10:15")   
str(myDate)

##  Date[1:1], format: "2015-10-19"

# what happens if the date has text at the end?
myDate2 <- as.Date("2015-10-19Hello")   
str(myDate2)

##  Date[1:1], format: "2015-10-19"

As we can see above the as.Date() function will convert the characters that it recognizes to be part of a date into a date class and ignore all other characters in the string.

R - Date-Time - The POSIX classes

If we have a column containing both date and time we need to use a class that stores both date AND time. Base R offers two closely related classes for date and time: POSIXct and POSIXlt.

POSIXct

The as.POSIXct method converts a date-time string into a POSIXct class.

# Convert character data to date and time.
timeDate <- as.POSIXct("2015-10-19 10:15")   
str(timeDate)

##  POSIXct[1:1], format: "2015-10-19 10:15:00"

timeDate

## [1] "2015-10-19 10:15:00 MDT"

as.POSIXct stores both a date and time with an associated time zone. The default time zone selected, is the time zone that your computer is set to which is most often your local time zone.

POSIXct stores date and time in seconds with the number of seconds beginning at 1 January 1970. Negative numbers are used to store dates prior to 1970. Thus, the POSIXct format stores each date and time a single value in units of seconds. Storing the data this way, optimizes use in data.frames and speeds up computation, processing and conversion to other formats.

# to see the data in this 'raw' format, i.e., not formatted according to the 
# class type to show us a date we recognize, use the `unclass()` function.
unclass(timeDate)

## [1] 1445271300
## attr(,"tzone")
## [1] ""

Here we see the number of seconds from 1 January 1970 to 9 October 2015 (1445271300). Also, we see that a time zone (tzone) is associated with the value in seconds.

**Data Tip:** The `unclass` method in R allows you to view how a particular R object is stored.

POSIXlt

The POSIXct format is optimized for storage and computation. However, humans aren't quite as computationally efficient as computers! Also, we often want to quickly extract some portion of the data (e.g., months). The POSIXlt class stores date and time information in a format that we are used to seeing (e.g., second, min, hour, day of month, month, year, numeric day of year, etc).

# Convert character data to POSIXlt date and time
timeDatelt<- as.POSIXlt("2015-10-19 10:15")  
str(timeDatelt)

##  POSIXlt[1:1], format: "2015-10-19 10:15:00"

timeDatelt

## [1] "2015-10-19 10:15:00 MDT"

unclass(timeDatelt)

## $sec
## [1] 0
## 
## $min
## [1] 15
## 
## $hour
## [1] 10
## 
## $mday
## [1] 19
## 
## $mon
## [1] 9
## 
## $year
## [1] 115
## 
## $wday
## [1] 1
## 
## $yday
## [1] 291
## 
## $isdst
## [1] 1
## 
## $zone
## [1] "MDT"
## 
## $gmtoff
## [1] NA

When we convert a string to POSIXlt, and view it in R, it still looks similar to the POSIXct format. However, unclass() shows us that the data are stored differently. The POSIXlt class stores the hour, minute, second, day, month, and year separately.

Months in POSIXlt

POSIXlt has a few quirks. First, the month values stored in the POSIXlt object use zero-based indexing. This means that month #1 (January) is stored as 0, and month #2 (February) is stored as 1. Notice in the output above, October is stored as the 9th month ($mon = 9).

Years in POSIXlt

Years are also stored differently in the POSIXlt class. Year values are stored using a base index value of 1900. Thus, 2015 is stored as 115 ($year = 115

  • 115 years since 1900).
**Data Tip:** To learn more about how R stores information within date-time and other objects check out the R documentation by using `?` (e.g., `?POSIXlt`). NOTE: you can use this same syntax to learn about particular functions (e.g., `?ggplot`).

Having dates classified as separate components makes POSIXlt computationally more resource intensive to use in data.frames. This slows things down! We will thus use POSIXct for this tutorial.

**Data Tip:** There are other R packages that support date-time data classes, including `readr`, `zoo` and `chron`.

Convert to Date-time Class

When we convert from a character to a date-time class we need to tell R how the date and time information are stored in each string. To do this, we can use format=. Let's have a look at one of our date-time strings to determine it's format.

# view one date-time field
harMet_15Min$datetime[1]

## [1] "2005-01-01T00:15"

Looking at the results above, we see that our data are stored in the format: Year-Month-Day "T" Hour:Minute (2005-01-01T00:15). We can use this information to populate our format string using the following designations for the components of the date-time data:

  • %Y - year
  • %m - month
  • %d - day
  • %H:%M - hours:minutes
**Data Tip:** A list of options for date-time format is available in the `strptime` function help: `help(strptime)`. Check it out!

The "T" inserted into the middle of datetime isn't a standard character for date and time, however we can tell R where the character is so it can ignore it and interpret the correct date and time as follows: format="%Y-%m-%dT%H:%M".

# convert single instance of date/time in format year-month-day hour:min:sec
as.POSIXct(harMet_15Min$datetime[1],format="%Y-%m-%dT%H:%M")

## [1] "2005-01-01 00:15:00 MST"

## The format of date-time MUST match the specified format or the data will not
# convert; see what happens when you try it a different way or without the "T"
# specified
as.POSIXct(harMet_15Min$datetime[1],format="%d-%m-%Y%H:%M")

## [1] NA

as.POSIXct(harMet_15Min$datetime[1],format="%Y-%m-%d%H:%M")

## [1] NA

Using the syntax we've learned, we can convert the entire datetime column into POSIXct class.

new.date.time <- as.POSIXct(harMet_15Min$datetime,
                            format="%Y-%m-%dT%H:%M" #format time
                            )

# view output
head(new.date.time)

## [1] "2005-01-01 00:15:00 MST" "2005-01-01 00:30:00 MST"
## [3] "2005-01-01 00:45:00 MST" "2005-01-01 01:00:00 MST"
## [5] "2005-01-01 01:15:00 MST" "2005-01-01 01:30:00 MST"

# what class is the output
class(new.date.time)

## [1] "POSIXct" "POSIXt"

About Time Zones

Above, we successfully converted our data into a date-time class. However, what timezone is the output new.date.time object that we created using?

2005-04-15 03:30:00 MDT

It appears as if our data were assigned MDT (which is the timezone of the computer where these tutorials were processed originally - in Colorado - Mountain Time). However, we know from the metadata, explored in the Why Metadata Are Important: How to Work with Metadata in Text & EML Format tutorial, that these data were stored in Eastern Standard Time.

Assign Time Zone

When we convert a date-time formatted column to POSIXct format, we need to assign an associated time zone. If we don't assign a time zone,R will default to the local time zone that is defined on your computer. There are individual designations for different time zones and time zone variants (e.g., does the time occur during daylight savings time).

**Data Tip:** Codes for time zones can be found in this time zone table.

The code for the Eastern time zone that is the closest match to the NEON Harvard Forest field site is America/New_York. Let's convert our datetime field one more time, and define the associated timezone (tz=).

# assign time zone to just the first entry
as.POSIXct(harMet_15Min$datetime[1],
            format = "%Y-%m-%dT%H:%M",
            tz = "America/New_York")

## [1] "2005-01-01 00:15:00 EST"

The output above, shows us that the time zone is now correctly set as EST.

Convert to Date-time Data Class

Now, using the syntax that we learned above, we can convert the entire datetime column to a POSIXct class.

# convert to POSIXct date-time class
harMet_15Min$datetime <- as.POSIXct(harMet_15Min$datetime,
                                format = "%Y-%m-%dT%H:%M",
                                tz = "America/New_York")

# view structure and time zone of the newly defined datetime column
str(harMet_15Min$datetime)

##  POSIXct[1:376800], format: "2005-01-01 00:15:00" "2005-01-01 00:30:00" ...

tz(harMet_15Min$datetime)

## [1] "America/New_York"

Now that our datetime data are properly identified as a POSIXct date-time data class we can continue on and look at the patterns of precipitation, temperature, and PAR through time.

Time Series 01: Why Metadata Are Important: How to Work with Metadata in Text & EML Format

This tutorial covers what metadata are, and why we need to work with metadata. It covers the 3 most common metadata formats: text file format, web page format and Ecological Metadata Language (EML).

R Skill Level: Introduction - you've got the basics of R down and understand the general structure of tabular data.

Learning Objectives

After completing this tutorial, you will be able to:

  • Import a .csv file and examine the structure of the related R object.
  • Use a metadata file to better understand the content of a dataset.
  • Explain the importance of including metadata details in your R script.
  • Describe what an EML file is.

Things You’ll Need To Complete This Tutorial

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Install R Packages

When presented in a workshop, the EML package will be presented as a demonstration!

If completing EML portion of tutorial on your own, you must install EML directly from GitHub (the package is in development and is not yet on CRAN). You will need to install devtools to do this.

  • ggmap: install.packages("ggmap")
  • tmaptools: install.packages("tmaptools")
  • devtools: install.packages("devtools");library(devtools)
  • EML: **must be installed AFTER devtools is loaded in R. install_github("ropensci/EML", build=FALSE, dependencies=c("DEPENDS", "IMPORTS"))

Download Data

NEON Teaching Data Subset: Meteorological Data for Harvard Forest

The data used in this lesson were collected at the National Ecological Observatory Network's Harvard Forest field site. These data are proxy data for what will be available for 30 years on the NEON data portal for the Harvard Forest and other field sites located across the United States.

Download Dataset


Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

Understand Our Data

In order to work with any data, we need to understand three things about the data:

  • Structure
  • Format
  • Processing methods

If the data are collected by other people and organizations, we might also need further information about:

  • What metrics are included in the dataset
  • The units those metrics were stored in
  • Explanation of where the metrics are stored in the data and what they are "called" (e.g. what are the column names in a spreadsheet)
  • The time range that it covers
  • The spatial extent that the data covers

The above information, and more are stored in metadata - data about the data. Metadata is information that describes a dataset and is integral to working with external data (data that we did not collect ourselves).

Metadata Formats

Metadata come in different formats. We will discuss three of those in this tutorial:

  • Ecological Metadata Language (EML): A standardized metadata format stored in xml format which is machine readable. Metadata has some standards however it's common to encounter metadata stored differently in EML files created by different organizations.
  • Text file: Sometimes metadata files can be found in text files that are either downloaded with a data product OR that are available separately for the data.
  • Directly on a website (HTML / XML): Sometimes data are documented directly in text format, on a web page.
**Data Tip:** When you find metadata for a dataset that you are working with, immediately **DOWNLOAD AND SAVE IT** to the directory on your computer where you saved the data. It is also a good idea to document the URL where you found the metadata and data in a "readme" text file!

Metadata Stored on a Web Page

The metadata for the data that we are working with for the Harvard Forest field site are stored in both EML format and on a web page. Let's explore the web page first.

  • View Harvard Forest Fisher Tower webpage.

Let's begin by visiting that page above. At the top of the page, there is a list of data available for Harvard Forest. NOTE: hf001-06: daily (metric) since 2001 (preview) is the data that we used in the previous tutorial.

Scroll down to the Overview section on the website. Take note of the information provided in that section and answer the questions in the Challenge below.

### Challenge: Explore Metadata

Explore the metadata stored on the Harvard Forest LTER web page. Answer the following questions.

  1. What is the time span of the data available for this site?
  2. You have some questions about these data. Who is the lead investigator / who do you contact for more information? And how do you contact them?
  3. Where is this field site located? How is the site location information stored in the metadata? Is there any information that may be useful to you viewing the location? HINT: What if you were not familiar with Harvard as a site / from another country, etc?
  4. Field Site Information: What is the elevation for the site? What is the dominant vegetation cover for the site? HINT: Is dominant vegetation easy to find in the metadata?
  5. How is snow dealt with in the precipitation data?
  6. Are there some metadata attributes that might be useful to access in a script in R or Python rather than viewed on a web page? HINT: Can you answer all of the questions above from the information provided on this website? Is there additional information that you might want to find on the page?

View Metadata For Metrics of Interest

For this tutorial series, we are interested in the drivers of plant phenology - specifically air and soil temperature, precipitation and photosynthetically active radiation (PAR). Let's look for descriptions of these variables in the metadata and determine several key attributes that we will need prior to working with the data.

### Challenge: Metrics of Interest Metadata

View the metadata at the URL above. Answer the following questions about the Harvard Forest LTER data - hf001-10: 15-minute (metric) since 2005:

  1. What is the column heading name where each variable (air temperature, soil temperature, precipitation and PAR) is stored?
  2. What are the units that each variable are stored in?
  3. What is the frequency of measurement collected for each and how are noData values stored?
  4. Where is the date information stored (in what field) and what timezone are the dates stored in?

Why Metadata on a Web Page Is Not Ideal

It is nice to have a web page that displays metadata information, however accessing metadata on a web page is difficult:

  • If the web page URL changes or the site goes down, the information is lost.
  • It's also more challenging to access metadata in text format on a web page programatically - like using R as an interface - which we often want to do when working with larger datasets.

A machine readable metadata file is better - especially when we are working with large data and we want to automate and carefully document workflows. The Ecological Metadata Language (EML) is one machine readable metadata format.

Ecological Metadata Language (EML)

While much of the documentation that we need to work with at the Harvard Forest field site is available directly on the Harvard Forest Data Archive page, the website also offers metadata in EML format.

Introduction to EML

The Ecological Metadata Language (EML) is a data specification developed specifically to document ecological data. An EML file is created using a XML based format. This means that content is embedded within hierarchical tags. For example, the title of a dataset might be embedded in a <title> tag as follows:

<title>Fisher Meteorological Station at Harvard Forest since 2001</title>

Similarly, the creator of a dataset is also be found in a hierarchical tag structure.

<creator>
  <individualName>
    <givenName>Emery</givenName>
    <surName>Boose</surName>
  </individualName>
</creator>

The EML package for R is designed to read and allow users to work with EML formatted metadata. In this tutorial, we demonstrate how we can use EML in an automated workflow.

NOTE: The EML package is still being developed, therefore we will not explicitly teach all details of how to use it. Instead, we will provide an example of how you can access EML files programmatically and background information so that you can further explore EML and the EML package if you need to work with it further.

EML Terminology

Let's first discuss some basic EML terminology. In the context of EML, each EML file documents a dataset. This dataset may consist of one or more files that contain the data in data tables. In the case of our tabular meteorology data, the structure of our EML file includes:

  1. The dataset. A dataset may contain one or more data tables associated with it that may contain different types of related information. For this Harvard Forest meteorological data, the data tables contain tower measurements including precipitation and temperature, that are aggregated at various time intervals (15 minute, daily, etc) and that date back to 2001.
  2. The data tables. Data tables refer to the actual data that make up the dataset. For the Harvard Forest data, each data table contains a suite of meteorological metrics, including precipitation and temperature (and associated quality flags), that are aggregated at a particular time interval (e.g. one data table contains monthly average data, another contains 15 minute averaged data, etc).

Work With EML in R

To begin, we will load the EML package directly from ROpenSci's Git repository.

# install R EML tool 
# load devtools (if you need to install "EML")
#library("devtools")
# IF YOU HAVE NOT DONE SO ALREADY: install EML from github -- package in
# development; not on CRAN
#install_github("ropensci/EML", build=FALSE, dependencies=c("DEPENDS", "IMPORTS"))

# load ROpenSci EML package
library(EML)
# load ggmap for mapping
library(ggmap)
# load tmaptools for mapping
library(tmaptools)

Next, we will read in the LTER EML file - directly from the online URL using eml_read. This file documents multiple data products that can be downloaded. Check out the Harvard Forest Data Archive Page for Fisher Meteorological Station for more on this dataset and to download the archive files directly.

Note that because this EML file is large, it takes a few seconds for the file to load.

# data location
# http://harvardforest.fas.harvard.edu:8080/exist/apps/datasets/showData.html?id=hf001
# table 4 http://harvardforest.fas.harvard.edu/data/p00/hf001/hf001-04-monthly-m.csv

# import EML from Harvard Forest Met Data
# note, for this particular tutorial, we will work with an abridged version of the file
# that you can access directly on the harvard forest website. (see comment below)
eml_HARV <- read_eml("https://harvardforest1.fas.harvard.edu/sites/harvardforest.fas.harvard.edu/files/data/eml/hf001.xml")

# import a truncated version of the eml file for quicker demonstration
# eml_HARV <- read_eml("http://neonscience.github.io/NEON-R-Tabular-Time-Series/hf001-revised.xml")

# view size of object
object.size(eml_HARV)

## 1299568 bytes

# view the object class
class(eml_HARV)

## [1] "emld" "list"

The eml_read function creates an EML class object. This object can be accessed using slots in R ($) rather than a typical subset [] approach.

Explore Metadata Attributes

We can begin to explore the contents of our EML file and associated data that it describes. Let's start at the dataset level. We can use slots to view the contact information for the dataset and a description of the methods.

# view the contact name listed in the file

eml_HARV$dataset$creator

## $individualName
## $individualName$givenName
## [1] "Emery"
## 
## $individualName$surName
## [1] "Boose"

# view information about the methods used to collect the data as described in EML
eml_HARV$dataset$methods

## $methodStep
## $methodStep$description
## $methodStep$description$section
## $methodStep$description$section[[1]]
## [1] "<title>Observation periods</title><para>15-minute: 15 minutes, ending with given time. Hourly: 1 hour, ending with given time. Daily: 1 day, midnight to midnight. All times are Eastern Standard Time.</para>"
## 
## $methodStep$description$section[[2]]
## [1] "<title>Instruments</title><para>Air temperature and relative humidity: Vaisala HMP45C (2.2m above ground). Precipitation: Met One 385 heated rain gage (top of gage 1.6m above ground). Global solar radiation: Licor LI200X pyranometer (2.7m above ground). PAR radiation: Licor LI190SB quantum sensor (2.7m above ground). Net radiation: Kipp and Zonen NR-LITE net radiometer (5.0m above ground). Barometric pressure: Vaisala CS105 barometer. Wind speed and direction: R.M. Young 05103 wind monitor (10m above ground). Soil temperature: Campbell 107 temperature probe (10cm below ground). Data logger: Campbell Scientific CR10X.</para>"
## 
## $methodStep$description$section[[3]]
## [1] "<title>Instrument and flag notes</title><para>Air temperature. Daily air temperature is estimated from other stations as needed to complete record.</para><para>Precipitation. Daily precipitation is estimated from other stations as needed to complete record. Delayed melting of snow and ice (caused by problems with rain gage heater or heavy precipitation) is noted in log - daily values are corrected if necessary but 15-minute values are not.  The gage may underestimate actual precipitation under windy or cold conditions.</para><para>Radiation. Whenever possible, snow and ice are removed from radiation instruments after precipitation events.  Depth of snow or ice on instruments and time of removal are noted in log, but values are not corrected or flagged.</para><para>Wind speed and direction. During ice storms, values are flagged as questionable when there is evidence (from direct observation or the 15-minute record) that ice accumulation may have affected the instrument's operation.</para>"

Identify & Map Data Location

Looking at the coverage for our data, there is only one unique x and y value. This suggests that our data were collected at (x,y) one point location. We know this is a tower so a point location makes sense. Let's grab the x and y coordinates and create a quick context map. We will use ggmap to create our map.

NOTE: If this were a rectangular extent, we'd want the bounding box not just the point. This is important if the data in raster, HDF5, or a similar format. We need the extent to properly geolocate and process the data.

# grab x coordinate from the coverage information
XCoord <- eml_HARV$dataset$coverage$geographicCoverage$boundingCoordinates$westBoundingCoordinate[[1]]
# grab y coordinate from the coverage information
YCoord <- eml_HARV$dataset$coverage$geographicCoverage$boundingCoordinates$northBoundingCoordinate[[1]]
# make a map and add the xy coordinates to it
ggmap(get_stamenmap(rbind(as.numeric(paste(geocode_OSM("Massachusetts")$bbox))), zoom = 11, maptype=c("toner")), extent=TRUE) + geom_point(aes(x=as.numeric(XCoord),y=as.numeric(YCoord)), 
             color="darkred", size=6, pch=18)

Map of Massachusetts, U.S.A., showing location of Harvard Forest

  • Learn more about ggmap: A nice cheatsheet created by NCEAS

The above example, demonstrated how we can extract information from an EML document and use it programatically in R! This is just the beginning of what we can do!

Metadata For Your Own Data

Now, imagine that you are working with someone else's data and you don't have a metadata file associated with it? How do you know what units the data were in? How the data were collected? The location that the data covers? It is equally important to create metadata for your own data, to make your data more efficiently "shareable".

Time Series 00: Intro to Time Series Data in R - Managing Date/Time Formats

This tutorial will demonstrate how to import a time series dataset stored in .csv format into R. It will explore data classes for columns in a data.frame and will walk through how to convert a date, stored as a character string, into a date class that R can recognize and plot efficiently.

Learning Objectives

After completing this tutorial, you will be able to:

  • Open a .csv file in R using read.csv()and understand why we are using that file type.
  • Work with data stored in different columns within a data.frame in R.
  • Examine R object structures and data classes.
  • Convert dates, stored as a character class, into an R date class.
  • Create a quick plot of a time-series dataset using qplot.

Things You’ll Need To Complete This Tutorial

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Install R Packages

  • ggplot2: install.packages("ggplot2")

More on Packages in R – Adapted from Software Carpentry.

Download Data

NEON Teaching Data Subset: Meteorological Data for Harvard Forest

The data used in this lesson were collected at the National Ecological Observatory Network's Harvard Forest field site. These data are proxy data for what will be available for 30 years on the NEON data portal for the Harvard Forest and other field sites located across the United States.

Download Dataset


Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

Data Related to Phenology

In this tutorial, we will explore atmospheric data (including temperature, precipitation and other metrics) collected by sensors mounted on a flux tower at the NEON Harvard Forest field site. We are interested in exploring changes in temperature, precipitation, Photosynthetically Active Radiation (PAR) and day length throughout the year -- metrics that impact changes in the timing of plant phenophases (phenology).

About .csv Format

The data that we will use is in .csv (comma-separated values) file format. The .csv format is a plain text format, where each value in the dataset is separate by a comma and each "row" in the dataset is separated by a line break. Plain text formats are ideal for working both across platforms (Mac, PC, LINUX, etc) and also can be read by many different tools. The plain text format is also less likely to become obsolete over time.

**Data Tip:** For more on .csv format see this Wikipedia article.

Import the Data

To begin, let's import the data into R. We can use base R functionality to import a .csv file. We will use the ggplot2 package to plot our data.

# Load packages required for entire script. 
# library(PackageName)  # purpose of package
library(ggplot2)   # efficient, pretty plotting - required for qplot function

# set working directory to ensure R can find the file we wish to import
# provide the location for where you've unzipped the lesson data
wd <- "~/Git/data/"
**Data Tip:** Good coding practice -- install and load all libraries at top of script. If you decide you need another package later on in the script, return to this area and add it. That way, with a glance, you can see all packages used in a given script.

Once our working directory is set, we can import the file using read.csv().

# Load csv file of daily meteorological data from Harvard Forest
harMet.daily <- read.csv(
      file=paste0(wd,"NEON-DS-Met-Time-Series/HARV/FisherTower-Met/hf001-06-daily-m.csv"),
      stringsAsFactors = FALSE
      )

stringsAsFactors=FALSE

When reading in files we most often use stringsAsFactors = FALSE. This setting ensures that non-numeric data (strings) are not converted to factors.

What Is A Factor?

A factor is similar to a category. However factors can be numerically interpreted (they can have an order) and may have a level associated with them.

Examples of factors:

  • Month Names (an ordinal variable): Month names are non-numerical but we know that April (month 4) comes after March (month 3) and each could be represented by a number (4 & 3).
  • 1 and 2s to represent male and female sex (a nominal variable): Numerical interpretation of non-numerical data but no order to the levels.
**Data Tip:** Read more about factors here.

After loading the data it is easy to convert any field that should be a factor by using as.factor(). Therefore it is often best to read in a file with stringsAsFactors = FALSE.

Data.Frames in R

The read.csv() imports our .csv into a data.frame object in R. data.frames are ideal for working with tabular data - they are similar to a spreadsheet.

# what type of R object is our imported data?
class(harMet.daily)

## [1] "data.frame"

Data Structure

Once the data are imported, we can explore their structure. There are several ways to examine the structure of a data frame:

  • head(): shows us the first 6 rows of the data (tail() shows the last 6 rows).
  • str() : displays the structure of the data as R interprets it.

Let's use both to explore our data.

# view first 6 rows of the dataframe 
head(harMet.daily)

##         date jd  airt f.airt airtmax f.airtmax airtmin f.airtmin rh
## 1 2001-02-11 42 -10.7           -6.9             -15.1           40
## 2 2001-02-12 43  -9.8           -2.4             -17.4           45
## 3 2001-02-13 44  -2.0            5.7              -7.3           70
## 4 2001-02-14 45  -0.5            1.9              -5.7           78
## 5 2001-02-15 46  -0.4            2.4              -5.7           69
## 6 2001-02-16 47  -3.0            1.3              -9.0           82
##   f.rh rhmax f.rhmax rhmin f.rhmin  dewp f.dewp dewpmax f.dewpmax
## 1         58            22         -22.2          -16.8          
## 2         85            14         -20.7           -9.2          
## 3        100            34          -7.6           -4.6          
## 4        100            59          -4.1            1.9          
## 5        100            37          -6.0            2.0          
## 6        100            46          -5.9           -0.4          
##   dewpmin f.dewpmin prec f.prec slrt f.slrt part f.part netr f.netr
## 1   -25.7            0.0        14.9          NA      M   NA      M
## 2   -27.9            0.0        14.8          NA      M   NA      M
## 3   -10.2            0.0        14.8          NA      M   NA      M
## 4   -10.2            6.9         2.6          NA      M   NA      M
## 5   -12.1            0.0        10.5          NA      M   NA      M
## 6   -10.6            2.3         6.4          NA      M   NA      M
##    bar f.bar wspd f.wspd wres f.wres wdir f.wdir wdev f.wdev gspd
## 1 1025        3.3         2.9         287          27        15.4
## 2 1033        1.7         0.9         245          55         7.2
## 3 1024        1.7         0.9         278          53         9.6
## 4 1016        2.5         1.9         197          38        11.2
## 5 1010        1.6         1.2         300          40        12.7
## 6 1016        1.1         0.5         182          56         5.8
##   f.gspd s10t f.s10t s10tmax f.s10tmax s10tmin f.s10tmin
## 1          NA      M      NA         M      NA         M
## 2          NA      M      NA         M      NA         M
## 3          NA      M      NA         M      NA         M
## 4          NA      M      NA         M      NA         M
## 5          NA      M      NA         M      NA         M
## 6          NA      M      NA         M      NA         M

# View the structure (str) of the data 
str(harMet.daily)

## 'data.frame':	5345 obs. of  46 variables:
##  $ date     : chr  "2001-02-11" "2001-02-12" "2001-02-13" "2001-02-14" ...
##  $ jd       : int  42 43 44 45 46 47 48 49 50 51 ...
##  $ airt     : num  -10.7 -9.8 -2 -0.5 -0.4 -3 -4.5 -9.9 -4.5 3.2 ...
##  $ f.airt   : chr  "" "" "" "" ...
##  $ airtmax  : num  -6.9 -2.4 5.7 1.9 2.4 1.3 -0.7 -3.3 0.7 8.9 ...
##  $ f.airtmax: chr  "" "" "" "" ...
##  $ airtmin  : num  -15.1 -17.4 -7.3 -5.7 -5.7 -9 -12.7 -17.1 -11.7 -1.3 ...
##  $ f.airtmin: chr  "" "" "" "" ...
##  $ rh       : int  40 45 70 78 69 82 66 51 57 62 ...
##  $ f.rh     : chr  "" "" "" "" ...
##  $ rhmax    : int  58 85 100 100 100 100 100 71 81 78 ...
##  $ f.rhmax  : chr  "" "" "" "" ...
##  $ rhmin    : int  22 14 34 59 37 46 30 34 37 42 ...
##  $ f.rhmin  : chr  "" "" "" "" ...
##  $ dewp     : num  -22.2 -20.7 -7.6 -4.1 -6 -5.9 -10.8 -18.5 -12 -3.5 ...
##  $ f.dewp   : chr  "" "" "" "" ...
##  $ dewpmax  : num  -16.8 -9.2 -4.6 1.9 2 -0.4 -0.7 -14.4 -4 0.6 ...
##  $ f.dewpmax: chr  "" "" "" "" ...
##  $ dewpmin  : num  -25.7 -27.9 -10.2 -10.2 -12.1 -10.6 -25.4 -25 -16.5 -5.7 ...
##  $ f.dewpmin: chr  "" "" "" "" ...
##  $ prec     : num  0 0 0 6.9 0 2.3 0 0 0 0 ...
##  $ f.prec   : chr  "" "" "" "" ...
##  $ slrt     : num  14.9 14.8 14.8 2.6 10.5 6.4 10.3 15.5 15 7.7 ...
##  $ f.slrt   : chr  "" "" "" "" ...
##  $ part     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ f.part   : chr  "M" "M" "M" "M" ...
##  $ netr     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ f.netr   : chr  "M" "M" "M" "M" ...
##  $ bar      : int  1025 1033 1024 1016 1010 1016 1008 1022 1022 1017 ...
##  $ f.bar    : chr  "" "" "" "" ...
##  $ wspd     : num  3.3 1.7 1.7 2.5 1.6 1.1 3.3 2 2.5 2 ...
##  $ f.wspd   : chr  "" "" "" "" ...
##  $ wres     : num  2.9 0.9 0.9 1.9 1.2 0.5 3 1.9 2.1 1.8 ...
##  $ f.wres   : chr  "" "" "" "" ...
##  $ wdir     : int  287 245 278 197 300 182 281 272 217 218 ...
##  $ f.wdir   : chr  "" "" "" "" ...
##  $ wdev     : int  27 55 53 38 40 56 24 24 31 27 ...
##  $ f.wdev   : chr  "" "" "" "" ...
##  $ gspd     : num  15.4 7.2 9.6 11.2 12.7 5.8 16.9 10.3 11.1 10.9 ...
##  $ f.gspd   : chr  "" "" "" "" ...
##  $ s10t     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ f.s10t   : chr  "M" "M" "M" "M" ...
##  $ s10tmax  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ f.s10tmax: chr  "M" "M" "M" "M" ...
##  $ s10tmin  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ f.s10tmin: chr  "M" "M" "M" "M" ...
**Data Tip:** You can adjust the number of rows returned when using the `head()` and `tail()` functions. For example you can use `head(harMet.daily, 10)` to display the first 10 rows of your data rather than 6.

Classes in R

The structure results above let us know that the attributes in our data.frame are stored as several different data types or classes as follows:

  • chr - Character: It holds strings that are composed of letters and words. Character class data can not be interpreted numerically - that is to say we can not perform math on these values even if they contain only numbers.
  • int - Integer: It holds numbers that are whole integers without decimals. Mathematical operations can be performed on integers.
  • num - Numeric: It accepts data that are a wide variety of numeric formats including decimals (floating point values) and integers. Numeric also accept larger numbers than int will.

Storing variables using different classes is a strategic decision by R (and other programming languages) that optimizes processing and storage. It allows:

  • data to be processed more quickly & efficiently.
  • the program (R) to minimize the storage size.

Differences Between Classes

Certain functions can be performed on certain data classes and not on others.

For example:

a <- "mouse"
b <- "sparrow"
class(a)

## [1] "character"

class(b)

## [1] "character"

# subtract a-b 
a-b

## Error in a - b: non-numeric argument to binary operator

You can not subtract two character values given they are not numbers.

c <- 2
d <- 1
class(c)

## [1] "numeric"

class(d)

## [1] "numeric"

# subtract a-b 
c-d

## [1] 1

Additionally, performing summary statistics and other calculations of different types of classes can yield different results.

# create a new object
speciesObserved <- c("speciesb","speciesc","speciesa")
speciesObserved

## [1] "speciesb" "speciesc" "speciesa"

# determine the class
class(speciesObserved)

## [1] "character"

# calculate the minimum
min(speciesObserved)

## [1] "speciesa"

# create numeric object
prec <- c(1,2,5,3,6)
# view class
class(prec)

## [1] "numeric"

# calculate min value
min(prec)

## [1] 1

We can calculate the minimum value for SpeciesObserved, a character data class, however it does not return a quantitative minimum. It simply looks for the first element, using alphabetical (rather than numeric) order. Yet, we can calculate the quantitative minimum value for prec a numeric data class.

Plot Data Using qplot()

Now that we've got classes down, let's plot one of the metrics in our data, air temperature -- airt. Given this is a time series dataset, we want to plot air temperature as it changes over time. We have a date-time column, date, so let's use that as our x-axis variable and airt as our y-axis variable.

We will use the qplot() (for quick plot) function in the ggplot2 package. The syntax for qplot() requires the x- and y-axis variables and then the R object that the variables are stored in.

**Data Tip:** Add a title to the plot using `main="Title string"`.
# quickly plot air temperature
qplot(x=date, y=airt, 
      data=harMet.daily,
      main="Daily Air Temperature\nNEON Harvard Forest Field Site")

Relationship Between Daily Air Temperature and Time at Harvard Forest Research Site

We have successfully plotted some data. However, what is happening on the x-axis?

R is trying to plot EVERY date value in our data, on the x-axis. This makes it hard to read. Why? Let's have a look at the class of the x-axis variable - date.

# View data class for each column that we wish to plot
class(harMet.daily$date)

## [1] "character"

class(harMet.daily$airt)

## [1] "numeric"

In this case, the date column is stored in our data.frame as a character class. Because it is a character, R does not know how to plot the dates as a continuous variable. Instead it tries to plot every date value as a text string. The airt data class is numeric so that metric plots just fine.

Date as a Date-Time Class

We need to convert our date column, which is currently stored as a character to a date-time class that can be displayed as a continuous variable. Lucky for us, R has a date class. We can convert the date field to a date class using as.Date().

# convert column to date class
harMet.daily$date <- as.Date(harMet.daily$date)

# view R class of data
class(harMet.daily$date)

## [1] "Date"

# view results
head(harMet.daily$date)

## [1] "2001-02-11" "2001-02-12" "2001-02-13" "2001-02-14" "2001-02-15"
## [6] "2001-02-16"

Now that we have adjusted the date, let's plot again. Notice that it plots much more quickly now that R recognizes date as a date class. R can aggregate ticks on the x-axis by year instead of trying to plot every day!

# quickly plot the data and include a title using main=""
# In title string we can use '\n' to force the string to break onto a new line
qplot(x=date,y=airt, 
      data=harMet.daily,
      main="Daily Air Temperature w/ Date Assigned\nNEON Harvard Forest Field Site")  

Relationship Between Daily Air Temperature and Time at Harvard Forest Research Site

### Challenge: Using ggplot2's qplot function
  1. Create a quick plot of the precipitation. Use the full time frame of data available in the harMet.daily object.
  2. Do precipitation and air temperature have similar annual patterns?
  3. Create a quick plot examining the relationship between air temperature and precipitation.

Hint: you can modify the X and Y axis labels using xlab="label text" and ylab="label text".

Relationship Between Daily Precipitation and Time at Harvard Forest Research SiteRelationship Between Daily Precipitation and Daily Air Temperature at Harvard Forest Research Site

Convert to Julian Day

This tutorial defines Julian (year) day as most often used in an ecological context, explains why Julian days are useful for analysis and plotting, and teaches how to create a Julian day variable from a Date or Data/Time class variable.

Learning Objectives

After completing this tutorial, you will be able to:

  • Define a Julian day (year day) as used in most ecological contexts.
  • Convert a Date or Date/Time class variable to a Julian day variable.

Things You’ll Need To Complete This Tutorial

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Install R Packages

  • lubridate: install.packages("lubridate")

More on Packages in R – Adapted from Software Carpentry.

Download Data

NEON Teaching Data Subset: Meteorological Data for Harvard Forest

The data used in this lesson were collected at the National Ecological Observatory Network's Harvard Forest field site. These data are proxy data for what will be available for 30 years on the NEON data portal for the Harvard Forest and other field sites located across the United States.

Download Dataset


Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

Convert Between Time Formats - Julian Days

Julian days, as most often used in an ecological context, is a continuous count of the number of days beginning at Jan 1 each year. Thus each year will have up to 365 (non-leap year) or 366 (leap year) days.

**Data Note:** This format can also be called ordinal day or year day. In some contexts, Julian day can refer specifically to a numeric day count since 1 January 4713 BCE or as a count from some other origin day, instead of an annual count of 365 or 366 days.

Including a Julian day variable in your dataset can be very useful when comparing data across years, when plotting data, and when matching your data with other types of data that include Julian day.

Load the Data

Load this dataset that we will use to convert a date into a year day or Julian day.

Notice the date is read in as a character and must first be converted to a Date class.

# Load packages required for entire script
library(lubridate)  #work with dates

# set working directory to ensure R can find the file we wish to import
wd <- "~/Git/data/"

# Load csv file of daily meteorological data from Harvard Forest
# Factors=FALSE so strings, series of letters/ words/ numerals, remain characters
harMet_DailyNoJD <- read.csv(
  file=paste0(wd,"NEON-DS-Met-Time-Series/HARV/FisherTower-Met/hf001-06-daily-m-NoJD.csv"),
  stringsAsFactors = FALSE
  )

# what data class is the date column? 
str(harMet_DailyNoJD$date)

##  chr [1:5345] "2/11/01" "2/12/01" "2/13/01" "2/14/01" "2/15/01" ...

# convert "date" from chr to a Date class and specify current date format
harMet_DailyNoJD$date<- as.Date(harMet_DailyNoJD$date, "%m/%d/%y")

Convert with yday()

To quickly convert from from Date to Julian days, can we use yday(), a function from the lubridate package.

# to learn more about it type
?yday

We want to create a new column in the existing data frame, titled julian, that contains the Julian day data.

# convert with yday into a new column "julian"
harMet_DailyNoJD$julian <- yday(harMet_DailyNoJD$date)  

# make sure it worked all the way through. 
head(harMet_DailyNoJD$julian) 

## [1] 42 43 44 45 46 47

tail(harMet_DailyNoJD$julian)

## [1] 268 269 270 271 272 273
**Data Tip:** In this tutorial we converted from `Date` class to a Julian day, however, you can convert between any recognized date/time class (POSIXct, POSIXlt, etc) and Julian day using `yday`.

Raster 01: Plot Raster Data in R

This tutorial reviews how to plot a raster in R using the plot() function. It also covers how to layer a raster on top of a hillshade to produce an eloquent map.

Learning Objectives

After completing this tutorial, you will be able to:

  • Know how to plot a single band raster in R.
  • Know how to layer a raster dataset on top of a hillshade to create an elegant basemap.

Things You’ll Need To Complete This Tutorial

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Install R Packages

  • terra: install.packages("terra")

  • More on Packages in R – Adapted from Software Carpentry.

Download Data

Data required for this tutorial will be downloaded using neonUtilities in the lesson.

The LiDAR and imagery data used in this lesson were collected over the National Ecological Observatory Network's Harvard Forest (HARV) field site.

The entire dataset can be accessed from the NEON Data Portal.


Set Working Directory: This lesson will explain how to set the working directory. You may wish to set your working directory to some other location, depending on how you prefer to organize your data.

An overview of setting the working directory in R can be found here.


Additional Resources

  • Read more about the terra package in R.

Plot Raster Data in R

In this tutorial, we will plot the Digital Surface Model (DSM) raster for the NEON Harvard Forest Field Site. We will use the hist() function as a tool to explore raster values. And render categorical plots, using the breaks argument to get bins that are meaningful representations of our data.

We will use the terra package in this tutorial. If you do not have the DSM_HARV variable as defined in the pervious tutorial, Intro To Raster In R, please download it using neonUtilities, as shown in the previous tutorial.

library(terra)



# set working directory

wd <- "~/data/"

setwd(wd)



# import raster into R

dsm_harv_file <- paste0(wd, "DP3.30024.001/neon-aop-products/2022/FullSite/D01/2022_HARV_7/L3/DiscreteLidar/DSMGtif/NEON_D01_HARV_DP3_732000_4713000_DSM.tif")

DSM_HARV <- rast(dsm_harv_file)

First, let's plot our Digital Surface Model object (DSM_HARV) using the plot() function. We add a title using the argument main="title".

# Plot raster object

plot(DSM_HARV, main="Digital Surface Model - HARV")

Digital surface model showing the continuous elevation of NEON's site Harvard Forest

Plotting Data Using Breaks

We can view our data "symbolized" or colored according to ranges of values rather than using a continuous color ramp. This is comparable to a "classified" map. However, to assign breaks, it is useful to first explore the distribution of the data using a histogram. The breaks argument in the hist() function tells R to use fewer or more breaks or bins.

If we name the histogram, we can also view counts for each bin and assigned break values.

# Plot distribution of raster values 

DSMhist<-hist(DSM_HARV,
     breaks=3,
     main="Histogram Digital Surface Model\n NEON Harvard Forest Field Site",
     col="lightblue",  # changes bin color
     xlab= "Elevation (m)")  # label the x-axis

Histogram of digital surface model showing the distribution of the elevation of NEON's site Harvard Forest

# Where are breaks and how many pixels in each category?

DSMhist$breaks

## [1] 300 350 400 450

DSMhist$counts

## [1] 355269 611685  33046

Looking at our histogram, R has binned out the data as follows:

  • 300-350m, 350-400m, 400-450m

We can determine that most of the pixel values fall in the 350-400m range with a few pixels falling in the lower and higher range. We could specify different breaks, if we wished to have a different distribution of pixels in each bin.

We can use those bins to plot our raster data. We will use the terrain.colors() function to create a palette of 3 colors to use in our plot.

The breaks argument allows us to add breaks. To specify where the breaks occur, we use the following syntax: breaks=c(value1,value2,value3). We can include as few or many breaks as we'd like.

# plot using breaks.

plot(DSM_HARV, 
     breaks = c(300, 350, 400, 450), 
     col = terrain.colors(3),
     main="Digital Surface Model (DSM) - HARV")

Digital surface model showing the elevation of NEON's site Harvard Forest with three breaks

Data Tip: Note that when we assign break values a set of 4 values will result in 3 bins of data.

Format Plot

If we need to create multiple plots using the same color palette, we can create an R object (myCol) for the set of colors that we want to use. We can then quickly change the palette across all plots by simply modifying the myCol object.

We can label the x- and y-axes of our plot too using xlab and ylab.

# Assign color to a object for repeat use/ ease of changing

myCol = terrain.colors(3)



# Add axis labels

plot(DSM_HARV, 
     breaks = c(300, 350, 400, 450), 
     col = myCol,
     main="Digital Surface Model - HARV", 
     xlab = "UTM Easting (m)", 
     ylab = "UTM Northing (m)")

Digital surface model showing the elevation of NEON's site Harvard Forest with UTM Westing Coordinate (m) on the x-axis and UTM Northing Coordinate (m) on the y-axis

Or we can also turn off the axes altogether.

# or we can turn off the axis altogether

plot(DSM_HARV, 
     breaks = c(300, 350, 400, 450), 
     col = myCol,
     main="Digital Surface Model - HARV", 
     axes=FALSE)

Digital surface model showing the elevation of NEON's site Harvard Forest with no axes

Challenge: Plot Using Custom Breaks

Create a plot of the Harvard Forest Digital Surface Model (DSM) that has:

  • Six classified ranges of values (break points) that are evenly divided among the range of pixel values.
  • Axis labels
  • A plot title

Hillshade & Layering Rasters

The terra package has built-in functions called terrain for calculating slope and aspect, and shade for computing hillshade from the slope and aspect. A hillshade is a raster that maps the shadows and texture that you would see from above when viewing terrain.

See the shade documentation for more details.

We can layer a raster on top of a hillshade raster for the same area, and use a transparency factor to created a 3-dimensional shaded effect.

slope <- terrain(DSM_HARV, "slope", unit="radians")

aspect <- terrain(DSM_HARV, "aspect", unit="radians")

hill <- shade(slope, aspect, 45, 270)

plot(hill, col=grey(0:100/100), legend=FALSE, mar=c(2,2,1,4))

plot(DSM_HARV, col=terrain.colors(25, alpha=0.35), add=TRUE, main="HARV DSM with Hillshade")

The alpha value determines how transparent the colors will be (0 being transparent, 1 being opaque). You can also change the color palette, for example, use rainbow() instead of terrain.color().

  • More information can be found in the R color palettes documentation.

Challenge: Create DTM & DSM for SJER

Download data from the NEON field site San Joaquin Experimental Range (SJER) and create maps of the Digital Terrain and Digital Surface Models.

Make sure to:

  • include hillshade in the maps,
  • label axes on the DSM map and exclude them from the DTM map,
  • add titles for the maps,
  • experiment with various alpha values and color palettes to represent the data.

Going On The Grid -- An Intro to Gridding & Spatial Interpolation

In this tutorial was originally created for an ESA brown-bag workshop. Here we present the main graphics and topics covered in the workshop.

Additional Resources

  • Learn more about LiDAR data in our tutorial The Basics of LiDAR - Light Detection and Ranging - Remote Sensing.
  • Learn more about Digital Surface Models, Digital Terrain Models and Canopy Height Models in our tutorial What is a CHM, DSM and DTM? About Gridded, Raster LiDAR Data.

How Does LiDAR Works?

About Rasters

For a full tutorial on rasters & raster data, please go through the Intro to Raster Data in R tutorial which provides a foundational concepts even if you aren't working with R.

A raster is a dataset made up of cells or pixels. Each pixel represents a value associated with a region on the earth’s surface.

The spatial resolution of a raster refers the size of each cell in meters. This size in turn relates to the area on the ground that the pixel represents. Source: National Ecological Observatory Network

For more on raster resolution, see our tutorial on The Relationship Between Raster Resolution, Spatial Extent & Number of Pixels.

Creating Surfaces (Rasters) From Points

There are several ways that we can get from the point data collected by lidar to the surface data that we want for different Digital Elevation Models or similar data we use in analyses and mapping.

Basic gridding does not allow for the recovery/classification of data in any area where data are missing. Interpolation (including Triangulated Irregular Network (TIN) Interpolation) allows for gaps to be covered so that there are not holes in the resulting raster surface.

Interpolation can be done in a number of different ways, some of which are deterministic and some are probabilistic.

When converting a set of sample points to a grid, there are many different approaches that should be considered. Source: National Ecological Observatory Network

Gridding Points

When creating a raster, you may chose to perform a direct gridding of the data. This means that you calculate one value for every cell in the raster where there are sample points. This value may be a mean of all points, a max, min or some other mathematical function. All other cells will then have no data values associated with them. This means you may have gaps in your data if the point spacing is not well distributed with at least one data point within the spatial coverage of each raster cell.

When you directly grid a dataset, values will only be calculated for cells that overlap with data points. Thus, data gaps will not be filled. Source: National Ecological Observatory Network

We can create a raster from points through a process called gridding. Gridding is the process of taking a set of points and using them to create a surface composed of a regular grid.

Animation showing the general process of taking lidar point clouds and converting them to a raster format. Source: Tristan Goulden, National Ecological Observatory Network

Spatial Interpolation

Spatial interpolation involves calculating the value for a query point (or a raster cell) with an unknown value from a set of known sample point values that are distributed across an area. There is a general assumption that points closer to the query point are more strongly related to that cell than those farther away. However this general assumption is applied differently across different interpolation functions.

Interpolation methods will estimate values for cells where no known values exist.

Deterministic & Probabilistic Interpolators

There are two main types of interpolation approaches:

  • Deterministic: create surfaces directly from measured points using a weighted distance or area function.
  • Probabilistic (Geostatistical): utilize the statistical properties of the measured points. Probabilistic techniques quantify the spatial auto-correlation among measured points and account for the spatial configuration of the sample points around the prediction location.

Different methods of interpolation are better for different datasets. This table lays out the strengths of some of the more common interpolation methods.

Abbreviations: TIN=Triangulated Irregular Network, IDW=Inverse Distance Weighted interpolation. Source: National Ecological Observatory Network

We will focus on deterministic methods in this tutorial.

Deterministic Interpolation Methods

Let's look at a few different deterministic interpolation methods to understand how different methods can affect an output raster.

Inverse Distance Weighted (IDW)

Inverse distance weighted (IDW) interpolation calculates the values of a query point (a cell with an unknown value) using a linearly weighted combination of values from nearby points.

IDW interpolation calculates the value of an unknown cell center value (a query point) using surrounding points with the assumption that closest points will be the most similar to the value of interest. Source: QGIS

Key Attributes of IDW Interpolation

  • The raster is derived based upon an assumed linear relationship between the location of interest and the distance to surrounding sample points. In other words, sample points closest to the cell of interest are assumed to be more related to its value than those further away.ID
  • Bounded/exact estimation, hence can not interpolate beyond the min/max range of data point values. This estimate within the range of existing sample point values can yield "flattened" peaks and valleys -- especially if the data didn't capture those high and low points.
  • Interpolated points are average values
  • Good for point data that are equally distributed and dense. Assumes a consistent trend or relationship between points and does not accommodate trends within the data(e.g. east to west, elevational, etc).
IDW interpolation looks at the linear distance between the unknown value and surrounding points. Source: J. Abrecht, CUNY

Power

The power value changes the "weighting" of the IDW interpolation by specifying how strongly points further away from the query point impact the calculated value for that point. Power values range from 0-3+ with a default settings generally being 2. A larger power value produces a more localized result - values further away from the cell have less impact on it's calculated value, values closer to the cell impact it's value more. A smaller power value produces a more averaged result where sample points further away from the cell have a greater impact on the cell's calculated value.

For visualizations of IDW interpolation, see Jochen Albrecht's Inverse Distance Weighted 3D Concepts Lecture.

The impacts of power:

  • lower power values more averaged result, potential for a smoother surface. As power decreases, the influence of sample points is larger. This yields a smoother surface that is more averaged.

  • greater power values: more localized result, potential for more peaks and valleys around sample point locations. As power increases, the influence of sample points falls off more rapidly with distance. The query cell values become more localized and less averaged.

IDW Take Home Points

IDW is good for:

  • Data whose distribution is strongly (and linearly) correlated with distance. For example, noise falls off very predictably with distance.
  • Providing explicit control over the influence of distance (compared to Spline or Kriging).

IDW is not so good for:

  • Data whose distribution depends on more complex sets of variables because it can account only for the effects of distance.

Other features:

  • You can create a smoother surface by decreasing the power, increasing the number of sample points used or increasing the search (sample points) radius.
  • You can create a surface that more closely represents the peaks and dips of your sample points by decreasing the number of sample points used, decreasing the search radius or increasing the power.
  • You can increase IDW surface accuracy by adding breaklines to the interpolation process that serve as barriers. Breaklines represent abrupt changes in elevation, such as cliffs.

Spline

Spline interpolation fits a curved surface through the sample points of your dataset. Imagine stretching a rubber sheet across your points and gluing it to each sample point along the way -- what you get is a smooth stretched sheet with bumps and valleys. Unlike IDW, spline can estimate values above and below the min and max values of your sample points. Thus it is good for estimating high and low values not already represented in your data.

For visualizations of Spline interpolation, see Jochen Albrecht's Spline 3D Concepts Lecture.

Regularized & Tension Spline

There are two types of curved surfaces that can be fit when using spline interpolation:

  1. Tension spline: a flatter surface that forces estimated values to stay closer to the known sample points.

  2. Regularized spline: a more elastic surface that is more likely to estimate above and below the known sample points.

Regular vs tension spline.

For more on spline interpolation, see ESRI's How Splines Work background materials.

Spline Take Home Points

Spline is good for:

  • Estimating values outside of the range of sample input data.
  • Creating a smooth continuous surface.

Spline is not so good for:

  • Points that are close together and have large value differences. Slope calculations can yield over and underestimation.
  • Data with large, sudden changes in values that need to be represented (e.g., fault lines, extreme vertical topographic changes, etc). NOTE: some tools like ArcGIS have introduced a spline with barriers function in recent years.

Natural Neighbor Interpolation

Natural neighbor interpolation finds the closest subset of data points to the query point of interest. It then applies weights to those points to calculate an average estimated value based upon their proportionate areas derived from their corresponding Voronoi polygons (see figure below for definition). The natural neighbor interpolator adapts locally to the input data using points surrounding the query point of interest. Thus there is no radius, number of points or other settings needed when using this approach.

This interpolation method works equally well on regular and irregularly spaced data.

A Voronoi diagram is created by taking pairs of points that are close together and drawing a line that is equidistant between them and perpendicular to the line connecting them. Source: Wikipedia

Natural neighbor interpolation uses the area of each Voronoi polygon associated with the surrounding points to derive a weight that is then used to calculate an estimated value for the query point of interest.

To calculate the weighted area around a query point, a secondary Voronoi diagram is created using the immediately neighboring points and mapped on top of the original Voronoi diagram created using the known sample points (image below).

A secondary Voronoi diagram is created using the immediately neighboring points and mapped on top of the original Voronoi diagram created using the known sample points to created a weighted Natural neighbor interpolated raster. Image Source: ESRI

For more on natural neighbor interpolation, see ESRI's How Natural Neighbor Works documentation.

Natural Neighbor Take Home Points

Natural Neighbor Interpolation is good for:

  • Data where spatial distribution is variable (and data that are equally distributed).
  • Categorical data.
  • Providing a smoother output raster.

Natural Neighbor Interpolation is not as good for:

  • Data where the interpolation needs to be spatially constrained (to a particular number of points of distance).
  • Data where sample points further away from or beyond the immediate "neighbor points" need to be considered in the estimation.

Other features:

  • Good as a local interpolator
  • Interpolated values fall within the range of values of the sample data
  • Surface passes through input samples; not above or below
  • Supports breaklines

Triangulated Irregular Network (TIN)

The Triangulated Irregular Network (TIN) is a vector based surface where sample points (nodes) are connected by a series of edges creating a triangulated surface. The TIN format remains the most true to the point distribution, density and spacing of a dataset. It also may yield the largest file size!

A TIN creating from LiDAR data collected by the NEON AOP over the NEON San Joaquin (SJER) field site.
  • For more on the TIN process see this information from ESRI: Overview of TINs

Interpolation in R, GrassGIS, or QGIS

These additional resources point to tools and information for gridding in R, GRASS GIS and QGIS.

R functions

The packages and functions maybe useful when creating grids in R.

  • gstat::idw()
  • stats::loess()
  • akima::interp()
  • fields::Tps()
  • fields::splint()
  • spatial::surf.ls()
  • geospt::rbf()

QGIS tools

  • Check out the documentation on different interpolation plugins Interpolation
  • Check out the documentation on how to convert a vector file to a raster: Rasterize (vector to raster).

The QGIS processing toolbox provides easy access to Grass commands.

GrassGIS commands

  • Check out the documentation on GRASS GIS Integration Starting the GRASS plugin

The following commands may be useful if you are working with GrassGIS.

  • v.surf.idw - Surface interpolation from vector point data by Inverse Distance Squared Weighting
  • v.surf.bspline - Bicubic or bilinear spline interpolation with Tykhonov regularization
  • v.surf.rst - Spatial approximation and topographic analysis using regularized spline with tension
  • v.to.rast.value - Converts (rasterize) a vector layer into a raster layer
  • v.delaunay - Creates a Delaunay triangulation from an input vector map containing points or centroids
  • v.voronoi - Creates a Voronoi diagram from an input vector layer containing points

Plot Spectral Signatures Derived from Hyperspectral Remote Sensing Data in HDF5 Format in R

Learning Objectives

After completing this tutorial, you will be able to:

  • Extract and plot a single spectral signature from an HDF5 file.
  • Work with groups and datasets within an HDF5 file.

Things You’ll Need To Complete This Tutorial

To complete this tutorial you will need the most current version of R and, preferably, RStudio loaded on your computer.

R Libraries to Install:

  • rhdf5: install.packages("BiocManager"), BiocManager::install("rhdf5")
  • plyr: install.packages('plyr')
  • ggplot2: install.packages('ggplot2')
  • neonUtilities: install.packages('neonUtilities')

More on Packages in R - Adapted from Software Carpentry.

Data

These hyperspectral remote sensing data provide information on the National Ecological Observatory Network's San Joaquin Experimental Range (SJER) field site in March of 2021. The data used in this lesson is the 1km by 1km mosaic tile named NEON_D17_SJER_DP3_257000_4112000_reflectance.h5. If you already completed the previous lesson in this tutorial series, you do not need to download this data again. The entire SJER reflectance dataset can be accessed from the NEON Data Portal.

Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

Recommended Skills

For this tutorial, you should be comfortable reading data from a HDF5 file and have a general familiarity with hyperspectral data. If you aren't familiar with these steps already, we highly recommend you work through the Introduction to Working with Hyperspectral Data in HDF5 Format in R tutorial before moving on to this tutorial.

Everything on our planet reflects electromagnetic radiation from the Sun, and different types of land cover often have dramatically different reflectance properties across the spectrum. One of the most powerful aspects of the NEON Imaging Spectrometer (NIS, or hyperspectral sensor) is that it can accurately measure these reflectance properties at a very high spectral resolution. When you plot the reflectance values across the observed spectrum, you will see that different land cover types (vegetation, pavement, bare soils, etc.) have distinct patterns in their reflectance values, a property that we call the 'spectral signature' of a particular land cover class.

In this tutorial, we will extract the reflectance values for all bands of a single pixel to plot a spectral signature for that pixel. In order to do this, we need to pair the reflectance values for that pixel with the wavelength values of the bands that are represented in those measurements. We will also need to adjust the reflectance values by the scaling factor that is saved as an 'attribute' in the HDF5 file. First, let's start by defining the working directory and reading in the example dataset.

# Call required packages

library(rhdf5)

library(plyr)

library(ggplot2)

library(neonUtilities)



wd <- "~/data/" #This will depend on your local environment

setwd(wd)

If you haven't already downloaded the hyperspectral data tile (in one of the previous tutorials in this series), you can use the neonUtilities function byTileAOP to download a single reflectance tile. You can run help(byTileAOP) to see more details on what the various inputs are. For this exercise, we'll specify the UTM Easting and Northing to be (257500, 4112500), which will download the tile with the lower left corner (257000, 4112000).

byTileAOP(dpID = 'DP3.30006.001',

          site = 'SJER',

          year = '2021',

          easting = 257500,

          northing = 4112500,

          savepath = wd)

This file will be downloaded into a nested subdirectory under the ~/data folder (your working directory), inside a folder named DP3.30006.001 (the Data Product ID). The file should show up in this location: ~/data/DP3.30006.001/neon-aop-products/2021/FullSite/D17/2021_SJER_5/L3/Spectrometer/Reflectance/NEON_D17_SJER_DP3_257000_4112000_reflectance.h5.

Now we can read in the file and look at the contents using h5ls. You can move this file to a different location, but make sure to change the path accordingly.

# define the h5 file name (specify the full path)

h5_file <- paste0(wd,"DP3.30006.001/neon-aop-products/2021/FullSite/D17/2021_SJER_5/L3/Spectrometer/Reflectance/NEON_D17_SJER_DP3_257000_4112000_reflectance.h5")



# look at the HDF5 file structure 

h5ls(h5_file) #optionally specify all=True if you want to see all of the information

##                                           group                                 name       otype  dclass               dim
## 0                                             /                                 SJER   H5I_GROUP                          
## 1                                         /SJER                          Reflectance   H5I_GROUP                          
## 2                             /SJER/Reflectance                             Metadata   H5I_GROUP                          
## 3                    /SJER/Reflectance/Metadata                    Ancillary_Imagery   H5I_GROUP                          
## 4  /SJER/Reflectance/Metadata/Ancillary_Imagery                Aerosol_Optical_Depth H5I_DATASET INTEGER       1000 x 1000
## 5  /SJER/Reflectance/Metadata/Ancillary_Imagery                               Aspect H5I_DATASET   FLOAT       1000 x 1000
## 6  /SJER/Reflectance/Metadata/Ancillary_Imagery                          Cast_Shadow H5I_DATASET INTEGER       1000 x 1000
## 7  /SJER/Reflectance/Metadata/Ancillary_Imagery Dark_Dense_Vegetation_Classification H5I_DATASET INTEGER       1000 x 1000
## 8  /SJER/Reflectance/Metadata/Ancillary_Imagery                 Data_Selection_Index H5I_DATASET INTEGER       1000 x 1000
## 9  /SJER/Reflectance/Metadata/Ancillary_Imagery                 Haze_Cloud_Water_Map H5I_DATASET INTEGER       1000 x 1000
## 10 /SJER/Reflectance/Metadata/Ancillary_Imagery                  Illumination_Factor H5I_DATASET INTEGER       1000 x 1000
## 11 /SJER/Reflectance/Metadata/Ancillary_Imagery                          Path_Length H5I_DATASET   FLOAT       1000 x 1000
## 12 /SJER/Reflectance/Metadata/Ancillary_Imagery                      Sky_View_Factor H5I_DATASET INTEGER       1000 x 1000
## 13 /SJER/Reflectance/Metadata/Ancillary_Imagery                                Slope H5I_DATASET   FLOAT       1000 x 1000
## 14 /SJER/Reflectance/Metadata/Ancillary_Imagery             Smooth_Surface_Elevation H5I_DATASET   FLOAT       1000 x 1000
## 15 /SJER/Reflectance/Metadata/Ancillary_Imagery                 Visibility_Index_Map H5I_DATASET INTEGER       1000 x 1000
## 16 /SJER/Reflectance/Metadata/Ancillary_Imagery                   Water_Vapor_Column H5I_DATASET   FLOAT       1000 x 1000
## 17 /SJER/Reflectance/Metadata/Ancillary_Imagery            Weather_Quality_Indicator H5I_DATASET INTEGER   3 x 1000 x 1000
## 18                   /SJER/Reflectance/Metadata                    Coordinate_System   H5I_GROUP                          
## 19 /SJER/Reflectance/Metadata/Coordinate_System             Coordinate_System_String H5I_DATASET  STRING             ( 0 )
## 20 /SJER/Reflectance/Metadata/Coordinate_System                            EPSG Code H5I_DATASET  STRING             ( 0 )
## 21 /SJER/Reflectance/Metadata/Coordinate_System                             Map_Info H5I_DATASET  STRING             ( 0 )
## 22 /SJER/Reflectance/Metadata/Coordinate_System                                Proj4 H5I_DATASET  STRING             ( 0 )
## 23                   /SJER/Reflectance/Metadata                    Flight_Trajectory   H5I_GROUP                          
## 24                   /SJER/Reflectance/Metadata                                 Logs   H5I_GROUP                          
## 25              /SJER/Reflectance/Metadata/Logs                               195724   H5I_GROUP                          
## 26       /SJER/Reflectance/Metadata/Logs/195724                     ATCOR_Input_file H5I_DATASET  STRING             ( 0 )
## 27       /SJER/Reflectance/Metadata/Logs/195724                 ATCOR_Processing_Log H5I_DATASET  STRING             ( 0 )
## 28       /SJER/Reflectance/Metadata/Logs/195724                Shadow_Processing_Log H5I_DATASET  STRING             ( 0 )
## 29       /SJER/Reflectance/Metadata/Logs/195724               Skyview_Processing_Log H5I_DATASET  STRING             ( 0 )
## 30       /SJER/Reflectance/Metadata/Logs/195724                  Solar_Azimuth_Angle H5I_DATASET   FLOAT             ( 0 )
## 31       /SJER/Reflectance/Metadata/Logs/195724                   Solar_Zenith_Angle H5I_DATASET   FLOAT             ( 0 )
## 32              /SJER/Reflectance/Metadata/Logs                               200251   H5I_GROUP                          
## 33       /SJER/Reflectance/Metadata/Logs/200251                     ATCOR_Input_file H5I_DATASET  STRING             ( 0 )
## 34       /SJER/Reflectance/Metadata/Logs/200251                 ATCOR_Processing_Log H5I_DATASET  STRING             ( 0 )
## 35       /SJER/Reflectance/Metadata/Logs/200251                Shadow_Processing_Log H5I_DATASET  STRING             ( 0 )
## 36       /SJER/Reflectance/Metadata/Logs/200251               Skyview_Processing_Log H5I_DATASET  STRING             ( 0 )
## 37       /SJER/Reflectance/Metadata/Logs/200251                  Solar_Azimuth_Angle H5I_DATASET   FLOAT             ( 0 )
## 38       /SJER/Reflectance/Metadata/Logs/200251                   Solar_Zenith_Angle H5I_DATASET   FLOAT             ( 0 )
## 39              /SJER/Reflectance/Metadata/Logs                               200812   H5I_GROUP                          
## 40       /SJER/Reflectance/Metadata/Logs/200812                     ATCOR_Input_file H5I_DATASET  STRING             ( 0 )
## 41       /SJER/Reflectance/Metadata/Logs/200812                 ATCOR_Processing_Log H5I_DATASET  STRING             ( 0 )
## 42       /SJER/Reflectance/Metadata/Logs/200812                Shadow_Processing_Log H5I_DATASET  STRING             ( 0 )
## 43       /SJER/Reflectance/Metadata/Logs/200812               Skyview_Processing_Log H5I_DATASET  STRING             ( 0 )
## 44       /SJER/Reflectance/Metadata/Logs/200812                  Solar_Azimuth_Angle H5I_DATASET   FLOAT             ( 0 )
## 45       /SJER/Reflectance/Metadata/Logs/200812                   Solar_Zenith_Angle H5I_DATASET   FLOAT             ( 0 )
## 46              /SJER/Reflectance/Metadata/Logs                               201441   H5I_GROUP                          
## 47       /SJER/Reflectance/Metadata/Logs/201441                     ATCOR_Input_file H5I_DATASET  STRING             ( 0 )
## 48       /SJER/Reflectance/Metadata/Logs/201441                 ATCOR_Processing_Log H5I_DATASET  STRING             ( 0 )
## 49       /SJER/Reflectance/Metadata/Logs/201441                Shadow_Processing_Log H5I_DATASET  STRING             ( 0 )
## 50       /SJER/Reflectance/Metadata/Logs/201441               Skyview_Processing_Log H5I_DATASET  STRING             ( 0 )
## 51       /SJER/Reflectance/Metadata/Logs/201441                  Solar_Azimuth_Angle H5I_DATASET   FLOAT             ( 0 )
## 52       /SJER/Reflectance/Metadata/Logs/201441                   Solar_Zenith_Angle H5I_DATASET   FLOAT             ( 0 )
## 53                   /SJER/Reflectance/Metadata                        Spectral_Data   H5I_GROUP                          
## 54     /SJER/Reflectance/Metadata/Spectral_Data                                 FWHM H5I_DATASET   FLOAT               426
## 55     /SJER/Reflectance/Metadata/Spectral_Data                           Wavelength H5I_DATASET   FLOAT               426
## 56                   /SJER/Reflectance/Metadata              to-sensor_azimuth_angle H5I_DATASET   FLOAT       1000 x 1000
## 57                   /SJER/Reflectance/Metadata               to-sensor_zenith_angle H5I_DATASET   FLOAT       1000 x 1000
## 58                            /SJER/Reflectance                     Reflectance_Data H5I_DATASET INTEGER 426 x 1000 x 1000

Read Wavelength Values

Next, let's read in the wavelength center associated with each band in the HDF5 file. We will later match these with the reflectance values and show both in our final spectral signature plot.

# read in the wavelength information from the HDF5 file

wavelengths <- h5read(h5_file,"/SJER/Reflectance/Metadata/Spectral_Data/Wavelength")

Extract Z-dimension data slice

Next, we will extract all reflectance values for one pixel. This makes up the spectral signature or profile of the pixel. To do that, we'll use the h5read() function. Here we pick an arbitrary pixel at (100,35), and use the NULL value to select all bands from that location.

# extract all bands from a single pixel

aPixel <- h5read(h5_file,"/SJER/Reflectance/Reflectance_Data",index=list(NULL,100,35))



# The line above generates a vector of reflectance values.

# Next, we reshape the data and turn them into a dataframe

b <- adply(aPixel,c(1))



# create clean data frame

aPixeldf <- b[2]



# add wavelength data to matrix

aPixeldf$Wavelength <- wavelengths



head(aPixeldf)

##    V1 Wavelength
## 1 206   381.6035
## 2 266   386.6132
## 3 274   391.6229
## 4 297   396.6327
## 5 236   401.6424
## 6 236   406.6522

Scale Factor

Then, we can pull the spatial attributes that we'll need to adjust the reflectance values. Often, large raster data contain floating point (values with decimals) information. However, floating point data consume more space (yield a larger file size) compared to integer values. Thus, to keep the file sizes smaller, the data will be scaled by a factor of 10, 100, 10000, etc. This scale factor will be noted in the data attributes.

# grab scale factor from the Reflectance attributes

reflectanceAttr <- h5readAttributes(h5_file,"/SJER/Reflectance/Reflectance_Data" )



scaleFact <- reflectanceAttr$Scale_Factor



# add scaled data column to DF

aPixeldf$scaled <- (aPixeldf$V1/as.vector(scaleFact))



# make nice column names

names(aPixeldf) <- c('Reflectance','Wavelength','ScaledReflectance')

head(aPixeldf)

##   Reflectance Wavelength ScaledReflectance
## 1         206   381.6035            0.0206
## 2         266   386.6132            0.0266
## 3         274   391.6229            0.0274
## 4         297   396.6327            0.0297
## 5         236   401.6424            0.0236
## 6         236   406.6522            0.0236

Plot Spectral Signature

Now we're ready to plot our spectral signature!

ggplot(data=aPixeldf)+
   geom_line(aes(x=Wavelength, y=ScaledReflectance))+
   xlab("Wavelength (nm)")+
   ylab("Reflectance")

Spectral signature plot with wavelength in nanometers on the x-axis and reflectance on the y-axis.

Filter, Piping, and GREPL Using R DPLYR - An Intro

Learning Objectives

After completing this tutorial, you will be able to:

  • Filter data, alone and combined with simple pattern matching grepl().
  • Use the group_by function in dplyr.
  • Use the summarise function in dplyr.
  • "Pipe" functions using dplyr syntax.

Things You’ll Need To Complete This Tutorial

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Install R Packages

  • neonUtilities: install.packages("neonUtilities") tools for working with NEON data
  • dplyr: install.packages("dplyr") used for data manipulation

Intro to dplyr

When working with data frames in R, it is often useful to manipulate and summarize data. The dplyr package in R offers one of the most comprehensive group of functions to perform common manipulation tasks. In addition, the dplyr functions are often of a simpler syntax than most other data manipulation functions in R.

Elements of dplyr

There are several elements of dplyr that are unique to the library, and that do very cool things!

Functions for manipulating data

The text below was exerpted from the R CRAN dpylr vignettes.

Dplyr aims to provide a function for each basic verb of data manipulating, like:

  • filter() (and slice())
    • filter rows based on values in specified columns
  • arrange()
    • sort data by values in specified columns
  • select() (and rename())
    • view and work with data from only specified columns
  • distinct()
    • view and work with only unique values from specified columns
  • mutate() (and transmute())
    • add new data to the data frame
  • summarise()
    • calculate specified summary statistics on data
  • sample_n() and sample_frac()
    • return a random sample of rows

Format of function calls

The single table verb functions share these features:

  • The first argument is a data.frame (or a dplyr special class tbl_df, known as a 'tibble').
    • dplyr can work with data.frames as is, but if you're dealing with large data it's worthwhile to convert them to a tibble, to avoid printing a lot of data to the screen. You can do this with the function
      as_tibble().
    • Calling the class function on a tibble will return the vector c("tbl_df", "tbl", "data.frame").
  • The subsequent arguments describe how to manipulate the data (e.g., based on which columns, using which summary statistics), and you can refer to columns directly (without using $).
  • The result is always a new tibble.
  • Function calls do not generate 'side-effects'; you always have to assign the results to an object

Grouped operations

Certain functions (e.g., group_by, summarise, and other 'aggregate functions') allow you to get information for groups of data, in one fell swoop. This is like performing database functions with knowing SQL or any other db specific code. Powerful stuff!

Piping

We often need to get a subset of data using one function, and then use another function to do something with that subset (and we may do this multiple times). This leads to nesting functions, which can get messy and hard to keep track of. Enter 'piping', dplyr's way of feeding the output of one function into another, and so on, without the hassleof parentheses and brackets.

Let's say we want to start with the data frame my_data, apply function1(), then function2(), and then function3(). This is what that looks like without piping:

function3(function2(function1(my_data)))

This is messy, difficult to read, and the reverse of the order our functions actually occur. If any of these functions needed additional arguments, the readability would be even worse!

The piping operator %>% takes everything in front of it and "pipes" it into the first argument of the function after. So now our example looks like this:

my_data %>%
  function1() %>%
  function2() %>%
  function3()

This runs identically to the original nested version!

For example, if we want to find the mean body weight of male mice, we'd do this:

	myMammalData %>%                     # start with a data frame
		filter(sex=='M') %>%               # first filter for rows where sex is male
		summarise (mean_weight = mean(weight))  # find the mean of the weight 
                                            # column, store as mean_weight

This is read as "from data frame myMammalData, select only males and return the mean weight as a new list mean_weight".

Download Small Mammal Data

Before we get started, we will need to download our data to investigate. To do so, we will use the loadByProduct() function from the neonUtilities package to download data straight from the NEON servers. To learn more about this function, please see the Download and Explore NEON data tutorial here.

Let's look at the NEON small mammal capture data from Harvard Forest (within Domain 01) for all of 2014. This site is located in the heart of the Lyme disease epidemic.

Read more about NEON terrestrial measurements here.

# load packages
library(dplyr)
library(neonUtilities)

# load the NEON small mammal capture data
# NOTE: the check.size = TRUE argument means the function 
# will require confirmation from you that you want to load 
# the quantity of data requested
loadData <- loadByProduct(dpID="DP1.10072.001", site = "HARV", 
                 startdate = "2014-01", enddate = "2014-12", 
                 check.size = TRUE) # Console requires your response!

# if you'd like, check out the data
str(loadData)

The loadByProduct() function calls the NEON server, downloads the monthly data files, and 'stacks' them together to form two tables of data called 'mam_pertrapnight' and 'mam_perplotnight'. It also saves the readme file, explanations of variables, and validation metadata, and combines these all into a single 'list' that we called 'loadData'. The only part of this list that we really need for this tutorial is the 'mam_pertrapnight' table, so let's extract just that one and call it 'myData'.

myData <- loadData$mam_pertrapnight

class(myData) # Confirm that 'myData' is a data.frame

## [1] "data.frame"

Use dplyr

For the rest of this tutorial, we are only going to be working with three variables within 'myData':

  • scientificName a string of "Genus species"
  • sex a string with "F", "M", or "U"
  • identificationQualifier a string noting uncertainty in the species identification

filter()

This function:

  • extracts only a subset of rows from a data frame according to specified conditions
  • is similar to the base function subset(), but with simpler syntax
  • inputs: data object, any number of conditional statements on the named columns of the data object
  • output: a data object of the same class as the input object (e.g., data.frame in, data.frame out) with only those rows that meet the conditions

For example, let's create a new dataframe that contains only female Peromyscus mainculatus, one of the key small mammal players in the life cycle of Lyme disease-causing bacterium.

# filter on `scientificName` is Peromyscus maniculatus and `sex` is female. 
# two equals signs (==) signifies "is"
data_PeroManicFemales <- filter(myData, 
                   scientificName == 'Peromyscus maniculatus', 
                   sex == 'F')

# Note how we were able to put multiple conditions into the filter statement,
# pretty cool!

So we have a dataframe with our female P. mainculatus but how many are there?

# how many female P. maniculatus are in the dataset
# would could simply count the number of rows in the new dataset
nrow(data_PeroManicFemales)

## [1] 98

# or we could write is as a sentence
print(paste('In 2014, NEON technicians captured',
                   nrow(data_PeroManicFemales),
                   'female Peromyscus maniculatus at Harvard Forest.',
                   sep = ' '))

## [1] "In 2014, NEON technicians captured 98 female Peromyscus maniculatus at Harvard Forest."

That's awesome that we can quickly and easily count the number of female Peromyscus maniculatus that were captured at Harvard Forest in 2014. However, there is a slight problem. There are two Peromyscus species that are common in Harvard Forest: Peromyscus maniculatus (deer mouse) and Peromyscus leucopus (white-footed mouse). These species are difficult to distinguish in the field, leading to some uncertainty in the identification, which is noted in the 'identificationQualifier' data field by the term "cf. species" or "aff. species". (For more information on these terms and 'open nomenclature' please see this great Wiki article here). When the field technician is certain of their identification (or if they forget to note their uncertainty), identificationQualifier will be NA. Let's run the same code as above, but this time specifying that we want only those observations that are unambiguous identifications.

# filter on `scientificName` is Peromyscus maniculatus and `sex` is female. 
# two equals signs (==) signifies "is"
data_PeroManicFemalesCertain <- filter(myData, 
                   scientificName == 'Peromyscus maniculatus', 
                   sex == 'F',
                   identificationQualifier =="NA")

# Count the number of un-ambiguous identifications
nrow(data_PeroManicFemalesCertain)

## [1] 0

Woah! So every single observation of a Peromyscus maniculatus had some level of uncertainty that the individual may actually be Peromyscus leucopus. This is understandable given the difficulty of field identification for these species. Given this challenge, it will be best for us to consider these mice at the genus level. We can accomplish this by searching for only the genus name in the 'scientificName' field using the grepl() function.

grepl()

This is a function in the base package (e.g., it isn't part of dplyr) that is part of the suite of Regular Expressions functions. grepl uses regular expressions to match patterns in character strings. Regular expressions offer very powerful and useful tricks for data manipulation. They can be complicated and therefore are a challenge to learn, but well worth it!

Here, we present a very simple example.

  • inputs: pattern to match, character vector to search for a match
  • output: a logical vector indicating whether the pattern was found within each element of the input character vector

Let's use grepl to learn more about our possible disease vectors. In reality, all species of Peromyscus are viable players in Lyme disease transmission, and they are difficult to distinguise in a field setting, so we really should be looking at all species of Peromyscus. Since we don't have genera split out as a separate field, we have to search within the scientificName string for the genus -- this is a simple example of pattern matching.

We can use the dplyr function filter() in combination with the base function grepl() to accomplish this.

# combine filter & grepl to get all Peromyscus (a part of the 
# scientificName string)
data_PeroFemales <- filter(myData,
                   grepl('Peromyscus', scientificName),
                   sex == 'F')

# how many female Peromyscus are in the dataset
print(paste('In 2014, NEON technicians captured',
                   nrow(data_PeroFemales),
                   'female Peromyscus spp. at Harvard Forest.',
                   sep = ' '))

## [1] "In 2014, NEON technicians captured 558 female Peromyscus spp. at Harvard Forest."

group_by() + summarise()

An alternative to using the filter function to subset the data (and make a new data object in the process), is to calculate summary statistics based on some grouping factor. We'll use group_by(), which does basically the same thing as SQL or other tools for interacting with relational databases. For those unfamiliar with SQL, no worries - dplyr provides lots of additional functionality for working with databases (local and remote) that does not require knowledge of SQL. How handy!

The group_by() function in dplyr allows you to perform functions on a subset of a dataset without having to create multiple new objects or construct for() loops. The combination of group_by() and summarise() are great for generating simple summaries (counts, sums) of grouped data.

NOTE: Be continentious about using summarise with other summary functions! You need to think about weighting for means and variances, and summarize doesn't work precisely for medians if there is any missing data (e.g., if there was no value recorded, maybe it was for a good reason!).

Continuing with our small mammal data, since the diversity of the entire small mammal community has been shown to impact disease dynamics among the key reservoir species, we really want to know more about the demographics of the whole community. We can quickly generate counts by species and sex in 2 lines of code, using group_by and summarise.

# how many of each species & sex were there?
# step 1: group by species & sex
dataBySpSex <- group_by(myData, scientificName, sex)

# step 2: summarize the number of individuals of each using the new df
countsBySpSex <- summarise(dataBySpSex, n_individuals = n())

## `summarise()` regrouping output by 'scientificName' (override with `.groups` argument)

# view the data (just top 10 rows)
head(countsBySpSex, 10)

## # A tibble: 10 x 3
## # Groups:   scientificName [5]
##    scientificName          sex   n_individuals
##    <chr>                   <chr>         <int>
##  1 Blarina brevicauda      F                50
##  2 Blarina brevicauda      M                 8
##  3 Blarina brevicauda      U                22
##  4 Blarina brevicauda      <NA>             19
##  5 Glaucomys volans        M                 1
##  6 Mammalia sp.            U                 1
##  7 Mammalia sp.            <NA>              1
##  8 Microtus pennsylvanicus F                 2
##  9 Myodes gapperi          F               103
## 10 Myodes gapperi          M                99

Note: the output of step 1 (dataBySpSex) does not look any different than the original dataframe (myData), but the application of subsequent functions (e.g., summarise) to this new dataframe will produce distinctly different results than if you tried the same operations on the original. Try it if you want to see the difference! This is because the group_by() function converted dataBySpSex into a "grouped_df" rather than just a "data.frame". In order to confirm this, try using the class() function on both myData and dataBySpSex. You can also read the help documentation for this function by running the code: ?group_by()

# View class of 'myData' object
class(myData)

## [1] "data.frame"

# View class of 'dataBySpSex' object
class(dataBySpSex)

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

# View help file for group_by() function
?group_by()

Pipe functions together

We created multiple new data objects during our explorations of dplyr functions, above. While this works, we can produce the same results more efficiently by chaining functions together and creating only one new data object that encapsulates all of the previously sought information: filter on only females, grepl to get only Peromyscus spp., group_by individual species, and summarise the numbers of individuals.

# combine several functions to get a summary of the numbers of individuals of 
# female Peromyscus species in our dataset.

# remember %>% are "pipes" that allow us to pass information from one function 
# to the next. 

dataBySpFem <- myData %>% 
                  filter(grepl('Peromyscus', scientificName), sex == "F") %>%
                  group_by(scientificName) %>%
                  summarise(n_individuals = n())

## `summarise()` ungrouping output (override with `.groups` argument)

# view the data
dataBySpFem

## # A tibble: 3 x 2
##   scientificName         n_individuals
##   <chr>                          <int>
## 1 Peromyscus leucopus              455
## 2 Peromyscus maniculatus            98
## 3 Peromyscus sp.                     5

Cool!

Base R only

So that is nice, but we had to install a new package dplyr. You might ask, "Is it really worth it to learn new commands if I can do this is base R." While we think "yes", why don't you see for yourself. Here is the base R code needed to accomplish the same task.

# For reference, the same output but using R's base functions

# First, subset the data to only female Peromyscus
dataFemPero  <- myData[myData$sex == 'F' & 
                   grepl('Peromyscus', myData$scientificName), ]

# Option 1 --------------------------------
# Use aggregate and then rename columns
dataBySpFem_agg <-aggregate(dataFemPero$sex ~ dataFemPero$scientificName, 
                   data = dataFemPero, FUN = length)
names(dataBySpFem_agg) <- c('scientificName', 'n_individuals')

# view output
dataBySpFem_agg

##           scientificName n_individuals
## 1    Peromyscus leucopus           455
## 2 Peromyscus maniculatus            98
## 3         Peromyscus sp.             5

# Option 2 --------------------------------------------------------
# Do it by hand

# Get the unique scientificNames in the subset
sppInDF <- unique(dataFemPero$scientificName[!is.na(dataFemPero$scientificName)])

# Use a loop to calculate the numbers of individuals of each species
sciName <- vector(); numInd <- vector()
for (i in sppInDF) {
  sciName <- c(sciName, i)
  numInd <- c(numInd, length(which(dataFemPero$scientificName==i)))
}

#Create the desired output data frame
dataBySpFem_byHand <- data.frame('scientificName'=sciName, 
                   'n_individuals'=numInd)

# view output
dataBySpFem_byHand

##           scientificName n_individuals
## 1    Peromyscus leucopus           455
## 2 Peromyscus maniculatus            98
## 3         Peromyscus sp.             5

Pagination

  • First page
  • Previous page
  • …
  • Page 49
  • Page 50
  • Page 51
  • Page 52
  • Page 53
  • Current page 54
  • Page 55
  • Page 56
  • Page 57
  • Next page
  • Last page
Subscribe to
NSF NEON, Operated by Battelle

Follow Us:

Join Our Newsletter

Get updates on events, opportunities, and how NEON is being used today.

Subscribe Now

Footer

  • About Us
  • Newsroom
  • Contact Us
  • Terms & Conditions
  • Careers
  • Code of Conduct

Copyright © Battelle, 2025

The National Ecological Observatory Network is a major facility fully funded by the U.S. National Science Foundation.

Any opinions, findings and conclusions or recommendations expressed in this material do not necessarily reflect the views of the U.S. National Science Foundation.