Tutorial
Working With Time Series Data Within a Nested HDF5 File in R
Authors: Leah A Wasser, Ted Hart
Last Updated: Nov 23, 2020
In this tutorial, we'll work with <a href="https://www.neonscience.org/data-collection/flux-tower-measurements target="_blank"> temperature data collected using sensors on a flux tower by the National Ecological Observatory Network (NEON). Here the data are provided in HDF5 format to allow for the exploration of this format. More information about NEON temperature data can be found on the the NEON Data Portal. Please note that at the present time temperature data are published on the data portal as a flat .csv file and not as an HDF5 file. NEON data products currently released in HDF5 include eddy covariance data and remote sensing data.
We'll examine our HDF5 file as if we knew nothing about it. We will
view its structure, extract metadata and visualize data contained
within datasets in the HDF5 file. We will also use use loops and custom
functions to efficiently examine data with a complex nested structure
using advanced tools like dplyr
.
Learning Objectives
After completing this tutorial, you will be able to:
- Explore an HDF5 file and access groups and datasets in R.
- Explain the basics of manipulating big datasets using indexing,
loops, and
dplyr
. - Refine your data plotting skills using
ggplot
in R. - Apply a variety of data manipulation tasks including identifying data types given a new dataset, string parsing, and working with & formatting date information.
Things You’ll Need To Complete This Tutorial
To complete this tutorial you will need the most current version of R and, preferably, RStudio loaded on your computer.
R Libraries to Install:
- rhdf5
-
ggplot2 for plotting:
install.packages("ggplot2")
-
dplyr for data manipulation:
install.packages("dplyr")
-
scales for plotting dates:
install.packages("scales")
More on Packages in R – Adapted from Software Carpentry.
Data to Download
We will use the file below in the optional challenge activity at the end of this tutorial.
NEON Teaching Data Subset: Sample Tower Temperature - HDF5
These temperature data were collected by the National Ecological Observatory Network's flux towers at field sites across the US. The entire dataset can be accessed by request from the NEON Data Portal.
Download DatasetSet Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.
An overview of setting the working directory in R can be found here.
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.
Additional Resources
Consider reviewing the documentation for the RHDF5 package.
rhdf5 package & R
To access HDF5 files in R, we'll use rhdf5
which is part of the
Bioconductor suite of
R packages.
It might also be useful to install HDFView which will allow you to explore the contents of an HDF5 file visually using a graphic interface.
# Install rhdf5 library
#install.packages("BiocManager")
#BiocManager::install("rhdf5")
library("rhdf5")
# also load ggplot2 and dplyr
library("ggplot2")
library("dplyr")
# a nice R packages that helps with date formatting is scale.
library("scales")
# set working directory to ensure R can find the file we wish to import and where
# we want to save our files
#setwd("working-dir-path-here")
HDF5 Quick Review
The HDF5 format is a self-contained directory structure. In HDF5 files though "directories" are called "groups" and "files" are called "datasets". Each element in an hdf5 file can have metadata attached to it making HDF5 files "self-describing".
Read more about HDF5 in this Data Skills tutorial.
Explore the HDF5 File Structure
Let's first explore an HDF5 file that we know nothing about using the R function,
h5ls
.
# Identify file path (be sure to adjust the path to match your file structure!)
f <- "NEONDSTowerTemperatureData.hdf5"
# View structure of file
h5ls(f)
## group name otype dclass dim
## 0 / Domain_03 H5I_GROUP
## 1 /Domain_03 OSBS H5I_GROUP
## 2 /Domain_03/OSBS min_1 H5I_GROUP
## 3 /Domain_03/OSBS/min_1 boom_1 H5I_GROUP
## 4 /Domain_03/OSBS/min_1/boom_1 temperature H5I_DATASET COMPOUND 4323
## 5 /Domain_03/OSBS/min_1 boom_2 H5I_GROUP
## 6 /Domain_03/OSBS/min_1/boom_2 temperature H5I_DATASET COMPOUND 4323
## 7 /Domain_03/OSBS/min_1 boom_3 H5I_GROUP
## 8 /Domain_03/OSBS/min_1/boom_3 temperature H5I_DATASET COMPOUND 4323
## 9 /Domain_03/OSBS/min_1 boom_5 H5I_GROUP
## 10 /Domain_03/OSBS/min_1/boom_5 temperature H5I_DATASET COMPOUND 4323
## 11 /Domain_03/OSBS/min_1 tower_top H5I_GROUP
## 12 /Domain_03/OSBS/min_1/tower_top temperature H5I_DATASET COMPOUND 4323
## 13 /Domain_03/OSBS min_30 H5I_GROUP
## 14 /Domain_03/OSBS/min_30 boom_1 H5I_GROUP
## 15 /Domain_03/OSBS/min_30/boom_1 temperature H5I_DATASET COMPOUND 147
## 16 /Domain_03/OSBS/min_30 boom_2 H5I_GROUP
## 17 /Domain_03/OSBS/min_30/boom_2 temperature H5I_DATASET COMPOUND 147
## 18 /Domain_03/OSBS/min_30 boom_3 H5I_GROUP
## 19 /Domain_03/OSBS/min_30/boom_3 temperature H5I_DATASET COMPOUND 147
## 20 /Domain_03/OSBS/min_30 boom_5 H5I_GROUP
## 21 /Domain_03/OSBS/min_30/boom_5 temperature H5I_DATASET COMPOUND 147
## 22 /Domain_03/OSBS/min_30 tower_top H5I_GROUP
## 23 /Domain_03/OSBS/min_30/tower_top temperature H5I_DATASET COMPOUND 147
## 24 / Domain_10 H5I_GROUP
## 25 /Domain_10 STER H5I_GROUP
## 26 /Domain_10/STER min_1 H5I_GROUP
## 27 /Domain_10/STER/min_1 boom_1 H5I_GROUP
## 28 /Domain_10/STER/min_1/boom_1 temperature H5I_DATASET COMPOUND 4323
## 29 /Domain_10/STER/min_1 boom_2 H5I_GROUP
## 30 /Domain_10/STER/min_1/boom_2 temperature H5I_DATASET COMPOUND 4323
## 31 /Domain_10/STER/min_1 boom_3 H5I_GROUP
## 32 /Domain_10/STER/min_1/boom_3 temperature H5I_DATASET COMPOUND 4323
## 33 /Domain_10/STER min_30 H5I_GROUP
## 34 /Domain_10/STER/min_30 boom_1 H5I_GROUP
## 35 /Domain_10/STER/min_30/boom_1 temperature H5I_DATASET COMPOUND 147
## 36 /Domain_10/STER/min_30 boom_2 H5I_GROUP
## 37 /Domain_10/STER/min_30/boom_2 temperature H5I_DATASET COMPOUND 147
## 38 /Domain_10/STER/min_30 boom_3 H5I_GROUP
## 39 /Domain_10/STER/min_30/boom_3 temperature H5I_DATASET COMPOUND 147
Note that h5ls
returns the structure of the HDF5 file structure including the
group and dataset names and associated types and sizes of each object. In our file,
there are datasets that are compound
in this file. Compound class means there
are a combination of datatypes within the datasets (e.g. numbers, strings, etc)
contained within that group.
Also note that you can add the recursive
variable to the h5ls
command to set
the number of nested levels that the command returns. Give it a try.
#specify how many "levels" of nesting are returns in the command
h5ls(f,recursive=2)
## group name otype dclass dim
## 0 / Domain_03 H5I_GROUP
## 1 /Domain_03 OSBS H5I_GROUP
## 2 / Domain_10 H5I_GROUP
## 3 /Domain_10 STER H5I_GROUP
h5ls(f,recursive=3)
## group name otype dclass dim
## 0 / Domain_03 H5I_GROUP
## 1 /Domain_03 OSBS H5I_GROUP
## 2 /Domain_03/OSBS min_1 H5I_GROUP
## 3 /Domain_03/OSBS min_30 H5I_GROUP
## 4 / Domain_10 H5I_GROUP
## 5 /Domain_10 STER H5I_GROUP
## 6 /Domain_10/STER min_1 H5I_GROUP
## 7 /Domain_10/STER min_30 H5I_GROUP
The Data Structure
Looking at the h5ls
output, we see this H5 file has a nested group and dataset
structure. Below, we will slice out temperature data which is located within the
following path:
Domain_03 --> OSBS --> min_1 --> boom_1 --> temperature
Take note that this path is 4 groups "deep" and leads to one dataset called temperature in this part of the HDF5 file as follows:
- Domain_03 - A NEON domain is an ecologically unique region. Domain 3 is one of 20 regions that NEON uses to organize its network spatially .
- OSBS - a group representing data from the Ordway Swisher Biological Station (OSBS).
- min_1 - A group representing the mean temperature data value for every for one minute in time. Temperature data are often collected at high frequencies (20 hz or 20 measurements a second) or more. A typical data product derived from high frequency tempearture data are an average value. In this case, all measurements are averaged every minute.
- boom_1 - Boom 1 is the first and lowest arm or level on the tower. Towers often contain arms where the sensors are mounted, that reach out horizontally away from the tower (see figure below). The tower at Ordway Swisher has a total of 6 booms (booms 1-5 and the tower top).

# read in temperature data
temp <- h5read(f,"/Domain_03/OSBS/min_1/boom_1/temperature")
# view the first few lines of the data
head(temp)
## date numPts mean min max variance stdErr
## 1 2014-04-01 00:00:00.0 60 15.06154 14.96886 15.15625 0.002655015 0.006652087
## 2 2014-04-01 00:01:00.0 60 14.99858 14.93720 15.04274 0.001254117 0.004571866
## 3 2014-04-01 00:02:00.0 60 15.26231 15.03502 15.56683 0.041437537 0.026279757
## 4 2014-04-01 00:03:00.0 60 15.45351 15.38553 15.53449 0.001174759 0.004424851
## 5 2014-04-01 00:04:00.0 60 15.35306 15.23799 15.42346 0.003526443 0.007666423
## 6 2014-04-01 00:05:00.0 60 15.12807 15.05846 15.23494 0.003764170 0.007920616
## uncertainty
## 1 0.01620325
## 2 0.01306111
## 3 0.05349682
## 4 0.01286833
## 5 0.01788372
## 6 0.01831239
# generate a quick plot of the data, type=l for "line"
plot(temp$mean,type='l')
We can make our plot look nicer by adding date values to the x axis. However, in order to list dates on the X axis, we need to assign the date field a date format so that R knows how to read and organize the labels on the axis.
Let's clean up the plot above. We can first add dates to the x axis. In order to list dates, we need to specify the format that the date field is in.
# First read in the time as UTC format
temp$date <- as.POSIXct(temp$date ,format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
# Create a plot using ggplot2 package
OSBS_Plot <- ggplot(temp,aes(x=date,y=mean))+
geom_path()+
ylab("Mean temperature") +
xlab("Date in UTC")+
ggtitle("3 Days of Temperature Data at Ordway Swisher")
OSBS_Plot
Let's have a close look at this plot. Notice anything unusual with it?
Hint: When would you expect the temperature to be the warmest during the day?

In this case, our data are in UTC time zone. The UTC time zone is a standardized time zone that does not observe daylight savings time. If your data are in UTC, then you will need to convert them to the local time zone where the data are collected for the dates and times to make sense when you plot & analyze your data.
For example, find your local time zone on this wikipedia page. How many hours difference is UTC from your local time?
To adjust for time, we need to tell R that the time zone where the data are
collected is Eastern Standard time since our data were collected at OSBS. We
can use the attributes
function to set the time zone.
# convert to local time for pretty plotting
attributes(temp$date)$tzone <- "US/Eastern"
# now, plot the data!
OSBS_Plot2 <- ggplot(temp,aes(x=date,y=mean))+
geom_path()+
ylab("Mean temperature") + xlab("Date in Eastern Standard Time")+
theme_bw()+
ggtitle("3 Days of Temperature Data at Ordway Swisher")
# let's check out the plot
OSBS_Plot2
Now the temperature peaks occur mid-afternoon when we'd expect them.
More on customizing plots here.
Extracting metadata
Metadata can be stored directly within HDF5 files and attached to each group
or
dataset
in the file - or to the file itself. To read the metadata for elements
in a HDF5 file in R we use the h5readAttributes
function.
To view the groups and datasets in our file, we will grab the nested structure, five 'levels' down gets us to the temperature dataset
# view temp data on "5th" level
fiu_struct <- h5ls(f,recursive=5)
# have a look at the structure.
fiu_struct
## group name otype dclass dim
## 0 / Domain_03 H5I_GROUP
## 1 /Domain_03 OSBS H5I_GROUP
## 2 /Domain_03/OSBS min_1 H5I_GROUP
## 3 /Domain_03/OSBS/min_1 boom_1 H5I_GROUP
## 4 /Domain_03/OSBS/min_1/boom_1 temperature H5I_DATASET COMPOUND 4323
## 5 /Domain_03/OSBS/min_1 boom_2 H5I_GROUP
## 6 /Domain_03/OSBS/min_1/boom_2 temperature H5I_DATASET COMPOUND 4323
## 7 /Domain_03/OSBS/min_1 boom_3 H5I_GROUP
## 8 /Domain_03/OSBS/min_1/boom_3 temperature H5I_DATASET COMPOUND 4323
## 9 /Domain_03/OSBS/min_1 boom_5 H5I_GROUP
## 10 /Domain_03/OSBS/min_1/boom_5 temperature H5I_DATASET COMPOUND 4323
## 11 /Domain_03/OSBS/min_1 tower_top H5I_GROUP
## 12 /Domain_03/OSBS/min_1/tower_top temperature H5I_DATASET COMPOUND 4323
## 13 /Domain_03/OSBS min_30 H5I_GROUP
## 14 /Domain_03/OSBS/min_30 boom_1 H5I_GROUP
## 15 /Domain_03/OSBS/min_30/boom_1 temperature H5I_DATASET COMPOUND 147
## 16 /Domain_03/OSBS/min_30 boom_2 H5I_GROUP
## 17 /Domain_03/OSBS/min_30/boom_2 temperature H5I_DATASET COMPOUND 147
## 18 /Domain_03/OSBS/min_30 boom_3 H5I_GROUP
## 19 /Domain_03/OSBS/min_30/boom_3 temperature H5I_DATASET COMPOUND 147
## 20 /Domain_03/OSBS/min_30 boom_5 H5I_GROUP
## 21 /Domain_03/OSBS/min_30/boom_5 temperature H5I_DATASET COMPOUND 147
## 22 /Domain_03/OSBS/min_30 tower_top H5I_GROUP
## 23 /Domain_03/OSBS/min_30/tower_top temperature H5I_DATASET COMPOUND 147
## 24 / Domain_10 H5I_GROUP
## 25 /Domain_10 STER H5I_GROUP
## 26 /Domain_10/STER min_1 H5I_GROUP
## 27 /Domain_10/STER/min_1 boom_1 H5I_GROUP
## 28 /Domain_10/STER/min_1/boom_1 temperature H5I_DATASET COMPOUND 4323
## 29 /Domain_10/STER/min_1 boom_2 H5I_GROUP
## 30 /Domain_10/STER/min_1/boom_2 temperature H5I_DATASET COMPOUND 4323
## 31 /Domain_10/STER/min_1 boom_3 H5I_GROUP
## 32 /Domain_10/STER/min_1/boom_3 temperature H5I_DATASET COMPOUND 4323
## 33 /Domain_10/STER min_30 H5I_GROUP
## 34 /Domain_10/STER/min_30 boom_1 H5I_GROUP
## 35 /Domain_10/STER/min_30/boom_1 temperature H5I_DATASET COMPOUND 147
## 36 /Domain_10/STER/min_30 boom_2 H5I_GROUP
## 37 /Domain_10/STER/min_30/boom_2 temperature H5I_DATASET COMPOUND 147
## 38 /Domain_10/STER/min_30 boom_3 H5I_GROUP
## 39 /Domain_10/STER/min_30/boom_3 temperature H5I_DATASET COMPOUND 147
# now we can use this object to pull group paths from our file!
fiu_struct[3,1]
## [1] "/Domain_03/OSBS"
## Let's view the metadata for the OSBS group
OSBS <- h5readAttributes(f,fiu_struct[3,1])
# view the attributes
OSBS
## $LatLon
## [1] "29.68927/-81.99343"
##
## $`Site Name`
## [1] "Ordway-Swisher Biological Station Site"
Now we can grab the latitude and longitude for our data from the attributes.
# view lat/long
OSBS$LatLon
## [1] "29.68927/-81.99343"
Note, for continued use we may want to convert the format from decimal degrees to a different format as this format is more difficult to extract from R!
-
Explore the H5 file further. View attributes for other groups within the file. If you have HDFView installed, compare what you see in R to what you see in the file visually using the HDFviewer.
-
How would you rewrite the metadata for each site to make it more user friendly? Discuss with your neighbor. Map out an H5 file that might be structured in a better way to store your data.
Workflows to Extract and Plot From Multiple Groups
The NEON HDF5 file that we are working with contains temperature data collected
for three days (a very small subset of the available data) by one sensor. What if
we wanted to create a plot that compared data across sensors or sites? To do this,
we need to compare data stored within different nested groups
within our H5 file.
Complex Data
Data are from different sensors located at different levels at one NEON Field Site, this multi-nested data can lead to a complex structure. Likely, you'll want to work with data from multiple sensors or levels to address your research questions.
Let's first compare data across temperature sensors located at one site. First, we'll loop through the HDF5 file and build a new data frame that contains temperature data for each boom on the tower. We'll use the 1-minute averaged data from the NEON field site: Ordway Swisher Biological Station located in Florida.
# use dplyr to subset data by dataset name (temperature)
# and site / 1 minute average
newStruct <- fiu_struct %>% filter(grepl("temperature",name),
grepl("OSBS/min_1",group))
#create final paths to access each temperature dataset
paths <- paste(newStruct$group,newStruct$name,sep="/")
#create a new, empty data.frame
OSBS_temp <- data.frame()
The above code uses dplyr
to filter data. Let's break the code down.
Read more about the dplyr
package here
-
fiu_struct
, defined above in the code, is the structure of our HDF5 file that we returned usingh5ls
. -
grepl
looks for a text pattern. Typehelp(grepl)
to learn more. If we want to return all "paths" in the HDF file that contain the wordtemperature
in the$name
column, then we typegrepl("temperature",name)
-
filter
allows us to look for multiple strings in one command.help(filter)
-
%>%
is apipe
- syntax specific to thedplyr
package. It allows you to 'chain' or combine multiple queries together into one, concise, line of code.
Pulling this together, type, fiu_struct %>% filter(grepl("OSBS/min_1",group))
in to the R console. What happens?
Next, we will create a loop that will populate the final data.frame
that contains
information for all booms in the site that we want to plot.
#loop through each temp dataset and add to data.frame
for(i in paths){
datasetName <- i
print(datasetName)
#read in each dataset in the H5 list
dat <- h5read(f,datasetName)
# add boom name to data.frame
print(strsplit(i,"/")[[1]][5])
dat$boom <- strsplit(i,"/")[[1]][5]
OSBS_temp <- rbind(OSBS_temp,dat)
}
## [1] "/Domain_03/OSBS/min_1/boom_1/temperature"
## [1] "boom_1"
## [1] "/Domain_03/OSBS/min_1/boom_2/temperature"
## [1] "boom_2"
## [1] "/Domain_03/OSBS/min_1/boom_3/temperature"
## [1] "boom_3"
## [1] "/Domain_03/OSBS/min_1/boom_5/temperature"
## [1] "boom_5"
## [1] "/Domain_03/OSBS/min_1/tower_top/temperature"
## [1] "tower_top"
The loop above iterates through the file and grabs the temperature data for each
boom in the 1 minute data series for Ordway. It also adds the boom name to the
end of the data.frame
as follows:
-
for i in path$path
: loop through eachpath
in thepath
object. NOTE: the boom 4 sensor was not operational when this HDF5 file was created, which is why there is no boom 4 in our list! Thus we will need do iterate through the data 5 times instead of 6. -
dat <- h5read(f,i)
: read in the temperature dataset from our hdf5 file (f) for pathi
. -
dat$boom <- strsplit(i,"/")[[1]][5]: add the boom name to a column called
boom` in our data.frame -
ord_temp <- rbind(ord_temp,dat)
: append dataset to the end of the data.frame
Modify the loop above so that it adds both the boom name, the site name and the data type (1 minute) as columns in our data frame.
Cleaning Up Dates
The dates field in our data frame aren't imported by default in "date format". We need to tell R to format the information as a date. Formatting out date fields also allows us to properly label the x axis of our plots.
Once the dates have been formatted we can create a plot with cleaner X axis labels.
#fix the dates
OSBS_temp$date <- as.POSIXct(OSBS_temp$date,format = "%Y-%m-%d %H:%M:%S", tz = "EST")
#plot the data
OSBS_allPlot <-ggplot(OSBS_temp,aes(x=date,y=mean,group=boom,colour=boom))+
geom_path()+
ylab("Mean temperature") + xlab("Date")+
theme_bw()+
ggtitle("3 Days of temperature data at Ordway Swisher")+
scale_x_datetime(breaks=pretty_breaks(n=4))
OSBS_allPlot
Data from different sites
Next, let's compare temperature at two different sites: Ordway Swisher Biological Station (OSBS) located in Florida and North Sterling (STER) located in north central Colorado. This time we'll plot data averaged every 30 minutes instead of every minute. We'll need to modify our search strings a bit. But we can still re-use most of the code that we just built.
# grab just the paths to temperature data, 30 minute average
pathStrux <- fiu_struct %>% filter(grepl("temperature",name),
grepl("min_30",group))
# create final paths
paths <- paste(pathStrux$group,pathStrux$name,sep="/")
# create empty dataframe
temp_30 <- data.frame()
for(i in paths){
#create columns for boom name and site name
boom <- strsplit(i,"/")[[1]][5]
site <- strsplit(i,"/")[[1]][3]
dat <- h5read(f,i)
dat$boom <- boom
dat$site <- site
temp_30 <- rbind(temp_30,dat)
}
# Assign the date field to a "date" format in R
temp_30$date <- as.POSIXct(temp_30$date,format = "%Y-%m-%d %H:%M:%S")
# generate a mean temperature for every date across booms
temp30_sum <- temp_30 %>% group_by(date,site) %>% summarise(mean = mean(mean))
# Create plot!
compPlot <- ggplot(temp30_sum,aes(x=date,y=mean,group=site,colour=site)) +
geom_path()+ylab("Mean temperature, 30 Minute Average") +
xlab("Date")+
theme_bw()+
ggtitle("Comparison of OSBS (FL) vs STER (CO)") +
scale_x_datetime( breaks=pretty_breaks(n=4))
compPlot