R Script & Challenge Code: NEON data lessons often contain challenges that reinforce
learned skills. If available, the code for challenge solutions is found in the
downloadable R script of the entire lesson, available in the footer of each lesson page.
Additional Resources
Consider reviewing the documentation for the
RHDF5 package.
About HDF5
The HDF5 file can store large, heterogeneous datasets that include metadata. It
also supports efficient data slicing, or extraction of particular subsets of a
dataset which means that you don't have to read large files read into the
computers memory / RAM in their entirety in order work with them.
To access HDF5 files in R, we will use the rhdf5 library which is part of
the Bioconductor
suite of R libraries. It might also be useful to install
the
free HDF5 viewer
which will allow you to explore the contents of an HDF5 file using a graphic interface.
First, let's get R setup. We will use the rhdf5 library. To access HDF5 files in
R, we will use the rhdf5 library which is part of the Bioconductor suite of R
packages. As of May 2020 this package was not yet on CRAN.
# Install rhdf5 package (only need to run if not already installed)
#install.packages("BiocManager")
#BiocManager::install("rhdf5")
# Call the R HDF5 Library
library("rhdf5")
# set working directory to ensure R can find the file we wish to import and where
# we want to save our files
wd <- "~/Git/data/" #This will depend on your local environment
setwd(wd)
Now, let's create a basic H5 file with one group and one dataset in it.
# Create hdf5 file
h5createFile("vegData.h5")
## [1] TRUE
# create a group called aNEONSite within the H5 file
h5createGroup("vegData.h5", "aNEONSite")
## [1] TRUE
# view the structure of the h5 we've created
h5ls("vegData.h5")
## group name otype dclass dim
## 0 / aNEONSite H5I_GROUP
Next, let's create some dummy data to add to our H5 file.
# create some sample, numeric data
a <- rnorm(n=40, m=1, sd=1)
someData <- matrix(a,nrow=20,ncol=2)
Write the sample data to the H5 file.
# add some sample data to the H5 file located in the aNEONSite group
# we'll call the dataset "temperature"
h5write(someData, file = "vegData.h5", name="aNEONSite/temperature")
# let's check out the H5 structure again
h5ls("vegData.h5")
## group name otype dclass dim
## 0 / aNEONSite H5I_GROUP
## 1 /aNEONSite temperature H5I_DATASET FLOAT 20 x 2
View a "dump" of the entire HDF5 file. NOTE: use this command with CAUTION if you
are working with larger datasets!
# we can look at everything too
# but be cautious using this command!
h5dump("vegData.h5")
## $aNEONSite
## $aNEONSite$temperature
## [,1] [,2]
## [1,] 0.33155432 2.4054446
## [2,] 1.14305151 1.3329978
## [3,] -0.57253964 0.5915846
## [4,] 2.82950139 0.4669748
## [5,] 0.47549005 1.5871517
## [6,] -0.04144519 1.9470377
## [7,] 0.63300177 1.9532294
## [8,] -0.08666231 0.6942748
## [9,] -0.90739256 3.7809783
## [10,] 1.84223101 1.3364965
## [11,] 2.04727590 1.8736805
## [12,] 0.33825921 3.4941913
## [13,] 1.80738042 0.5766373
## [14,] 1.26130759 2.2307994
## [15,] 0.52882731 1.6021497
## [16,] 1.59861449 0.8514808
## [17,] 1.42037674 1.0989390
## [18,] -0.65366487 2.5783750
## [19,] 1.74865593 1.6069304
## [20,] -0.38986048 -1.9471878
# Close the file. This is good practice.
H5close()
Add Metadata (attributes)
Let's add some metadata (called attributes in HDF5 land) to our dummy temperature
data. First, open up the file.
# open the file, create a class
fid <- H5Fopen("vegData.h5")
# open up the dataset to add attributes to, as a class
did <- H5Dopen(fid, "aNEONSite/temperature")
# Provide the NAME and the ATTR (what the attribute says) for the attribute.
h5writeAttribute(did, attr="Here is a description of the data",
name="Description")
h5writeAttribute(did, attr="Meters",
name="Units")
Now we can add some attributes to the file.
# let's add some attributes to the group
did2 <- H5Gopen(fid, "aNEONSite/")
h5writeAttribute(did2, attr="San Joaquin Experimental Range",
name="SiteName")
h5writeAttribute(did2, attr="Southern California",
name="Location")
# close the files, groups and the dataset when you're done writing to them!
H5Dclose(did)
H5Gclose(did2)
H5Fclose(fid)
Working with an HDF5 File in R
Now that we've created our H5 file, let's use it! First, let's have a look at
the attributes of the dataset and group in the file.
# look at the attributes of the precip_data dataset
h5readAttributes(file = "vegData.h5",
name = "aNEONSite/temperature")
## $Description
## [1] "Here is a description of the data"
##
## $Units
## [1] "Meters"
# look at the attributes of the aNEONsite group
h5readAttributes(file = "vegData.h5",
name = "aNEONSite")
## $Location
## [1] "Southern California"
##
## $SiteName
## [1] "San Joaquin Experimental Range"
# let's grab some data from the H5 file
testSubset <- h5read(file = "vegData.h5",
name = "aNEONSite/temperature")
testSubset2 <- h5read(file = "vegData.h5",
name = "aNEONSite/temperature",
index=list(NULL,1))
H5close()
Once we've extracted data from our H5 file, we can work with it
in R.
# create a quick plot of the data
hist(testSubset2)
### Challenge: Work with HDF5 Files
Time to practice the skills you've learned. Open up the D17_2013_SJER_vegStr.csv
in R.
Create a new HDF5 file called vegStructure.
Add a group in your HDF5 file called SJER.
Add the veg structure data to that folder.
Add some attributes the SJER group and to the data.
Now, repeat the above with the D17_2013_SOAP_vegStr csv.
Name your second group SOAP
Hint: read.csv() is a good way to read in .csv files.
Several factors contributed to the extreme flooding that occurred in Boulder,
Colorado in 2013. In this data activity, we explore and visualize the data for
stream discharge data collected by the United States Geological Survey (USGS).
The tutorial is part of the Data Activities that can be used
with the
Quantifying The Drivers and Impacts of Natural Disturbance Events Teaching Module.
Learning Objectives
After completing this tutorial, you will be able to:
Publish & share an interactive plot of the data using Plotly.
Things You'll Need To Complete This Lesson
Please be sure you have the most current version of R and, preferably,
RStudio to write your code.
R Libraries to Install:
ggplot2:install.packages("ggplot2")
plotly:install.packages("plotly")
Data to Download
We include directions on how to directly find and access the data from USGS's
National National Water Information System Database. However, depending on your
learning objectives you may prefer to use the
provided teaching data subset that can be downloaded from the NEON Data Skills account
on FigShare.
Set Working Directory This lesson assumes that you have set your working
directory to the location of the downloaded and unzipped data subsets.
R Script & Challenge Code: NEON data lessons often contain challenges that
reinforce learned skills. If available, the code for challenge solutions is found
in the downloadable R script of the entire lesson, available in the footer of each lesson page.
Research Question
What were the patterns of stream discharge prior to and during the 2013 flooding
events in Colorado?
About the Data - USGS Stream Discharge Data
The USGS has a distributed network of aquatic sensors located in streams across
the United States. This network monitors a suit of variables that are important
to stream morphology and health. One of the metrics that this sensor network
monitors is Stream Discharge, a metric which quantifies the volume of water
moving down a stream. Discharge is an ideal metric to quantify flow, which
increases significantly during a flood event.
As defined by USGS: Discharge is the volume of water moving down a stream or
river per unit of time, commonly expressed in cubic feet per second or gallons
per day. In general, river discharge is computed by multiplying the area of
water in a channel cross section by the average velocity of the water in that
cross section.
For more on stream discharge by USGS.
The USGS tracks stream discharge through time at locations across the United
States. Note the pattern observed in the plot above. The peak recorded discharge
value in 2013 was significantly larger than what was observed in other years.
Source: USGS, National Water Information System.
Obtain USGS Stream Gauge Data
This next section explains how to find and locate data through the USGS's
National Water Information System portal.
If you want to use the pre-compiled dataset at the FigShare link above, you can skip this
section and start again at the Work With Stream Gauge Data header.
Step 1: Search for the data
To search for stream gauge data in a particular area, we can use the
interactive map of all USGS stations.
By searching for locations around "Boulder, CO", we can find 3 gauges in the area.
For this lesson, we want data collected by USGS stream gauge 06730200 located on
Boulder Creek at North 75th St. This gauge is one of the few the was able to continuously
collect data throughout the 2013 Boulder floods.
You can directly access the data for this station through the "Access Data" link
on the map icon or searching for this site on the
National Water Information System portal .
On the
Boulder Creek stream gauge 06730200 page
, we can now see summary information about the types of data available for this
station. We want to select Daily Data and then the following parameters:
Available Parameters = 00060 Discharge (Mean)
Output format = Tab-separated
Begin Date = 1 October 1986
End Date = 31 December 2013
Now click "Go".
Step 2: Save data to .txt
The output is a plain text page that you must copy into a spreadsheet of
choice and save as a .csv. Note, you can also download the teaching dataset
(above) or access the data through an API (see Additional Resources, below).
Work with Stream Gauge Data
R Packages
We will use ggplot2 to efficiently plot our data and plotly to create interactive plots.
# load packages
library(ggplot2) # create efficient, professional plots
library(plotly) # create cool interactive plots
## Set your working directory to ensure R can find the file we wish to import and where we want to save our files. Be sure to move the downloaded files into your working directory!
wd <- "~/data/" # This will depend on your local environment
setwd(wd)
Import USGS Stream Discharge Data Into R
Now that we better understand the data that we are working with, let's import it into R. First, open up the discharge/06730200-discharge_daily_1986-2013.txt file in a text editor.
What do you notice about the structure of the file?
The first 24 lines are descriptive text and not actual data. Also notice that this file is separated by tabs, not commas. We will need to specify the
tab delimiter when we import our data.We will use the read.csv() function to import it into an R object.
When we use read.csv(), we need to define several attributes of the file
including:
The data are tab delimited. We will this tell R to use the "\\t"separator, which defines a tab delimited separation.
The first group of 24 lines in the file are not data; we will tell R to skip
those lines when it imports the data using skip=25.
Our data have a header, which is similar to column names in a spreadsheet. We
will tell R to set header=TRUE to ensure the headers are imported as column
names rather than data values.
Finally we will set stringsAsFactors = FALSE to ensure our data come in as individual values.
Let's import our data.
(Note: you can use the discharge/06730200-discharge_daily_1986-2013.csv file
and read it in directly using read.csv() function and then skip to the View
Data Structure section).
#import data
discharge <- read.csv(paste0(wd,"disturb-events-co13/discharge/06730200-discharge_daily_1986-2013.txt"),
sep= "\t", skip=24,
header=TRUE,
stringsAsFactors = FALSE)
#view first few lines
head(discharge)
## agency_cd site_no datetime X17663_00060_00003 X17663_00060_00003_cd
## 1 5s 15s 20d 14n 10s
## 2 USGS 06730200 1986-10-01 30 A
## 3 USGS 06730200 1986-10-02 30 A
## 4 USGS 06730200 1986-10-03 30 A
## 5 USGS 06730200 1986-10-04 30 A
## 6 USGS 06730200 1986-10-05 30 A
When we import these data, we can see that the first row of data is a second
header row rather than actual data. We can remove the second row of header
values by selecting all data beginning at row 2 and ending at the
total number or rows in the file and re-assigning it to the variable discharge. The nrow function will count the total
number of rows in the object.
# nrow: how many rows are in the R object
nrow(discharge)
## [1] 9955
# remove the first line from the data frame (which is a second list of headers)
# the code below selects all rows beginning at row 2 and ending at the total
# number of rows.
discharge <- discharge[2:nrow(discharge),]
Metadata
We now have an R object that includes only rows containing data values. Each
column also has a unique column name. However the column names may not be
descriptive enough to be useful - what is X17663_00060_00003?.
Reopen the discharge/06730200-discharge_daily_1986-2013.txt file in a text editor or browser. The text at
the top provides useful metadata about our data. On rows 10-12, we see that the
values in the 5th column of data are "Discharge, cubic feed per second (Mean)". Rows 14-16 tell us more about the 6th column of data,
the quality flags.
Now that we know what the columns are, let's rename column 5, which contains the
discharge value, as disValue and column 6 as qualFlag so it is more "human
readable" as we work with it
in R.
#view structure of data
str(discharge)
## 'data.frame': 9954 obs. of 5 variables:
## $ agency_cd: chr "USGS" "USGS" "USGS" "USGS" ...
## $ site_no : chr "06730200" "06730200" "06730200" "06730200" ...
## $ datetime : chr "1986-10-01" "1986-10-02" "1986-10-03" "1986-10-04" ...
## $ disValue : chr "30" "30" "30" "30" ...
## $ qualCode : chr "A" "A" "A" "A" ...
It appears as if the discharge value is a character (chr) class. This is
likely because we had an additional row in our data. Let's convert the discharge
column to a numeric class. In this case, we can reassign that column to be of
class: integer given there are no decimal places.
# view class of the disValue column
class(discharge$disValue)
## [1] "character"
# convert column to integer
discharge$disValue <- as.integer(discharge$disValue)
str(discharge)
## 'data.frame': 9954 obs. of 5 variables:
## $ agency_cd: chr "USGS" "USGS" "USGS" "USGS" ...
## $ site_no : chr "06730200" "06730200" "06730200" "06730200" ...
## $ datetime : chr "1986-10-01" "1986-10-02" "1986-10-03" "1986-10-04" ...
## $ disValue : int 30 30 30 30 30 30 30 30 30 31 ...
## $ qualCode : chr "A" "A" "A" "A" ...
Converting Time Stamps
We have converted our discharge data to an integer class. However, the time
stamp field, datetime is still a character class.
To work with and efficiently plot time series data, it is best to convert date
and/or time data to a date/time class. As we have both date and time date, we
will use the class POSIXct.
#view class
class(discharge$datetime)
## [1] "character"
#convert to date/time class - POSIX
discharge$datetime <- as.POSIXct(discharge$datetime, tz ="America/Denver")
#recheck data structure
str(discharge)
## 'data.frame': 9954 obs. of 5 variables:
## $ agency_cd: chr "USGS" "USGS" "USGS" "USGS" ...
## $ site_no : chr "06730200" "06730200" "06730200" "06730200" ...
## $ datetime : POSIXct, format: "1986-10-01" "1986-10-02" "1986-10-03" ...
## $ disValue : int 30 30 30 30 30 30 30 30 30 31 ...
## $ qualCode : chr "A" "A" "A" "A" ...
No Data Values
Next, let's query our data to check whether there are no data values in
it. The metadata associated with the data doesn't specify what the values would
be, NA or -9999 are common values
# check total number of NA values
sum(is.na(discharge$datetime))
## [1] 0
# check for "strange" values that could be an NA indicator
hist(discharge$disValue)
Excellent! The data contains no NoData values.
Plot The Data
Finally, we are ready to plot our data. We will use ggplot from the ggplot2
package to create our plot.
ggplot(discharge, aes(datetime, disValue)) +
geom_point() +
ggtitle("Stream Discharge (CFS) for Boulder Creek") +
xlab("Date") + ylab("Discharge (Cubic Feet per Second)")
Questions:
What patterns do you see in the data?
Why might there be an increase in discharge during a particular time of year?
Plot Data Time Subsets With ggplot
We can plot a subset of our data within ggplot() by specifying the start and
end times (in a limits object) for the x-axis with scale_x_datetime. Let's
plot data for the months directly around the Boulder flood: August 15 2013 -
October 15 2013.
# Define Start and end times for the subset as R objects that are the time class
start.end <- as.POSIXct(c("2013-08-15 00:00:00","2013-10-15 00:00:00"),tz= "America/Denver")
# plot the data - Aug 15-October 15
ggplot(discharge,
aes(datetime,disValue)) +
geom_point() +
scale_x_datetime(limits=start.end) +
xlab("Date") + ylab("Discharge (Cubic Feet per Second)") +
ggtitle("Stream Discharge (CFS) for Boulder Creek\nAugust 15 - October 15, 2013")
## Warning: Removed 9892 rows containing missing values (`geom_point()`).
We get a warning message because we are "ignoring" lots of the data in the
dataset.
Plotly - Interactive (and Online) Plots
We have now successfully created a plot. We can turn that plot into an interactive
plot using Plotly. Plotly
allows you to create interactive plots that can also be shared online. If
you are new to Plotly, view the companion mini-lesson
Interactive Data Vizualization with R and Plotly
to learn how to set up an account and access your username and API key.
Time subsets in plotly
To plot a subset of the total data we have to manually subset the data as the Plotly
package doesn't (yet?) recognize the limits method of subsetting.
Here we create a new R object with entries corresponding to just the dates we want and then plot that data.
# subset out some of the data - Aug 15 - October 15
discharge.aug.oct2013 <- subset(discharge,
datetime >= as.POSIXct('2013-08-15 00:00',
tz = "America/Denver") &
datetime <= as.POSIXct('2013-10-15 23:59',
tz = "America/Denver"))
# plot the data
disPlot.plotly <- ggplot(data=discharge.aug.oct2013,
aes(datetime,disValue)) +
geom_point(size=3) # makes the points larger than default
disPlot.plotly
# add title and labels
disPlot.plotly <- disPlot.plotly +
theme(axis.title.x = element_blank()) +
xlab("Time") + ylab("Stream Discharge (CFS)") +
ggtitle("Stream Discharge - Boulder Creek 2013")
disPlot.plotly
You can now display your interactive plot in R using the following command:
# view plotly plot in R
ggplotly(disPlot.plotly)
If you are satisfied with your plot you can now publish it to your Plotly account, if desired.
# set username
Sys.setenv("plotly_username"="yourUserNameHere")
# set user key
Sys.setenv("plotly_api_key"="yourUserKeyHere")
# publish plotly plot to your plotly online account if you want.
api_create(disPlot.plotly)
Additional Resources
Additional information on USGS streamflow measurements and data:
USGS data can be downloaded via an API using a command line interface. This is
particularly useful if you want to request data from multiple sites or build the
data request into a script.
Read more here about API downloads of USGS data.
R Script & Challenge Code: NEON data lessons often contain challenges that
reinforce learned skills. If available, the code for challenge solutions is found
in the downloadable R script of the entire lesson, available in the footer of each lesson page.
Research Question: How do We Measure Changes in Terrain?
Questions
How can LiDAR data be collected?
How might we use LiDAR to help study the 2013 Colorado Floods?
Additional LiDAR Background Materials
This data activity below assumes basic understanding of remote sensing and
associated landscape models and the use of raster data and plotting raster objects
in R. Consider using these other resources if you wish to gain more background
in these areas.
Digital Terrain Models, Digital Surface Models and Canopy Height
Models are three common LiDAR-derived data products. The digital terrain model
allows scientists to study changes in terrain (topography) over time.
How might we use a CHM, DSM or DTM model to better understand what happened
in the 2013 Colorado Flood?
Would you use only one of the models or could you use two or more of them
together?
In this Data Activity, we will be using Digital Terrain Models (DTMs).
More Details on LiDAR
If you are particularly interested in how LiDAR works consider taking a closer
look at how the data are collected and represented by going through this tutorial
on
"Light Detection and Ranging."
Digital Terrain Models
We can use a digital terrain model (DTM) to view the surface of the earth. By
comparing a DTM from before a disturbance event with one from after a disturbance
event, we can get measurements of where the landscape changed.
First, we need to load the necessary R packages to work with raster files and
set our working directory to the location of our data.
Then we can read in two DTMs. The first DTM preDTM3.tif is a terrain model created from data
collected 26-27 June 2013 and the postDTM3.tif is a terrain model made from data collected
on 8 October 2013.
Among the information we now about our data from looking at the raster structure,
is that the units are in meters for both rasters.
Hillshade layers are models created to add visual depth to maps. It represents
what the terrain would look like in shadow with the sun at a specific azimuth.
The default azimuth for many hillshades is 315 degrees -- to the NW.
# Creating hillshade for DTM_pre & DTM_post
# In order to generate the hillshde, we need both the slope and the aspect of
# the extent we are working on.
DTM_pre_slope <- terrain(DTM_pre, v="slope", unit="radians")
DTM_pre_aspect <- terrain(DTM_pre, v="aspect", unit="radians")
DTM_pre_hillshade <- shade(DTM_pre_slope, DTM_pre_aspect)
DTM_post_slope <- terrain(DTM_post, v="slope", unit="radians")
DTM_post_aspect <- terrain(DTM_post, v="aspect", unit="radians")
DTM_post_hillshade <- shade(DTM_post_slope, DTM_post_aspect)
Now we can plot the raster objects (DTM & hillshade) together by using add=TRUE
when plotting the second plot. To be able to see the first (hillshade) plot,
through the second (DTM) plot, we also set a value between 0 (transparent) and 1
(not transparent) for the alpha= argument.
# plot Pre-flood w/ hillshade
plot(DTM_pre_hillshade,
col=grey(1:90/100), # create a color ramp of grey colors for hillshade
legend=FALSE, # no legend, we don't care about the values of the hillshade
main="Pre-Flood DEM: Four Mile Canyon, Boulder County",
axes=FALSE) # makes for a cleaner plot, if the coordinates aren't necessary
plot(DTM_pre,
axes=FALSE,
alpha=0.3, # sets how transparent the object will be (0=transparent, 1=not transparent)
add=TRUE) # add=TRUE (or T), add plot to the previous plotting frame
Zoom in on the main stream bed. Are changes more visible? Can you tell
where erosion has occurred? Where soil deposition has occurred?
Digital Elevation Model of Difference (DoD)
A Digital Elevation Model of Difference (DoD) is a model of the
change (or difference) between two other digital elevation models - in our case
DTMs.
# DoD: erosion to be neg, deposition to be positive, therefore post - pre
DoD <- DTM_post-DTM_pre
plot(DoD,
main="Digital Elevation Model (DEM) Difference",
axes=FALSE)
Here we have our DoD, but it is a bit hard to read. What does the scale bar tell
us?
Everything in the yellow shades are close to 0m of elevation change, those areas
toward green are up to 10m increase of elevation, and those areas to red and
white are a 5m or more decrease in elevation.
We can see a distribution of the values better by viewing a histogram of all
the values in the DoD raster object.
# histogram of values in DoD
hist(DoD)
## Warning: [hist] a sample of25% of the cells was used
Most of the areas have a very small elevation change. To make the map easier to
read, we can do two things.
Set breaks for where we want the color to represent: The plot of the DoD
above uses a continuous scale to show the gradation between the loss of
elevation and the gain in elevation. While this provides a great deal of
information, in this case with much of the change around 0 and only a few outlying
values near -5m or 10m a categorical scale could help us visualize the data better.
In the plotting code we can set this with the breaks= argument in the plot()
function. Let's use breaks of -5, -1, -0.5, 0.5, 1, 10 -- which will give use 5
categories.
Change the color scheme: We can specify a color for each of elevation
categories we just specified with the breaks.
ColorBrewer 2.0 is a great reference for choosing mapping color palettes and
provide the hex codes we need for specifying the colors of the map. Once we've
chosen appropriate colors, we can create a vector of those colors and then use
that vector with the `col=` argument in the `plot()` function to specify these.
Let's now implement these two changes in our code.
# Color palette for 5 categories
difCol5 = c("#d7191c","#fdae61","#ffffbf","#abd9e9","#2c7bb6")
# Alternate palette for 7 categories - try it out!
#difCol7 = c("#d73027","#fc8d59","#fee090","#ffffbf","#e0f3f8","#91bfdb","#4575b4")
# plot hillshade first
plot(DTM_post_hillshade,
col=grey(1:90/100), # create a color ramp of grey colors
legend=FALSE,
main="Elevation Change Post-Flood: Four Mile Canyon, Boulder County",
axes=FALSE)
# add the DoD to it with specified breaks & colors
plot(DoD,
breaks = c(-5,-1,-0.5,0.5,1,10),
col= difCol5,
axes=FALSE,
alpha=0.3,
add =T)
Question
Do you think this is the best color scheme or set point for the breaks? Create
a new map that uses different colors and/or breaks. Does it more clearly show
patterns than this plot?
Optional Extension: Crop to Defined Area
If we want to crop the map to a smaller area, say the mouth of the canyon
where most of the erosion and deposition appears to have occurred, we can crop
by using known geographic locations (in the same CRS as the raster object) or
by manually drawing a box.
Method 1: Manually draw cropbox
# plot the rasters you want to crop from
plot(DTM_post_hillshade,
col=grey(1:90/100), # create a color ramp of grey colors
legend=FALSE,
main="Pre-Flood Elevation: Four Mile Canyon, Boulder County",
axes=FALSE)
plot(DoD,
breaks = c(-5,-1,-0.5,0.5,1,10),
col= difCol5,
axes=FALSE,
alpha=0.3,
add =T)
# crop by designating two opposite corners
cropbox1 <- draw()
After executing the draw() function, we now physically click on the plot
at the two opposite corners of the box you want to crop to. You should see a
red bordered polygon display on the raster at this point.
When we call this new object, we can view the new extent.
# view the extent of the cropbox1
cropbox1
## [1] 473814 474982 4434537 4435390
It is a good idea to write this new extent down, so that you can use the extent
again the next time you run the script.
Method 2: Define the cropbox
If you know the desired extent of the object you can also use it to crop the box,
by creating an object that is a vector containing the four vertices (x min,
x max, y min, and y max) of the polygon.
# desired coordinates of the box
cropbox2 <- c(473792.6,474999,4434526,4435453)
Once you have the crop box defined, either by manually clicking or by setting
the coordinates, you can crop the desired layer to the crop box.
# crop desired layers to the cropbox2 extent
DTM_pre_crop <- crop(DTM_pre, cropbox2)
DTM_post_crop <- crop(DTM_post, cropbox2)
DTMpre_hill_crop <- crop(DTM_pre_hillshade,cropbox2)
DTMpost_hill_crop <- crop(DTM_post_hillshade,cropbox2)
DoD_crop <- crop(DoD, cropbox2)
# plot the pre- and post-flood elevation + DEM difference rasters again, using the cropped layers
# PRE-FLOOD (w/ hillshade)
plot(DTMpre_hill_crop,
col=grey(1:90/100), # create a color ramp of grey colors:
legend=FALSE,
main="Pre-Flood Elevation: Four Mile Canyon, Boulder County ",
axes=FALSE)
plot(DTM_pre_crop,
axes=FALSE,
alpha=0.3,
add=T)
# POST-FLOOD (w/ hillshade)
plot(DTMpost_hill_crop,
col=grey(1:90/100), # create a color ramp of grey colors
legend=FALSE,
main="Post-Flood Elevation: Four Mile Canyon, Boulder County",
axes=FALSE)
plot(DTM_post_crop,
axes=FALSE,
alpha=0.3,
add=T)
# ELEVATION CHANGE - DEM Difference
plot(DTMpost_hill_crop,
col=grey(1:90/100), # create a color ramp of grey colors
legend=FALSE,
main="Post-Flood Elevation Change: Four Mile Canyon, Boulder County",
axes=FALSE)
plot(DoD_crop,
breaks = c(-5,-1,-0.5,0.5,1,10),
col= difCol5,
axes=FALSE,
alpha=0.3,
add =T)
Now you have a graphic of your particular area of interest.
The St. Vrain River in Boulder County, CO after (left) and before
(right) the 2013 flooding event. Source: Boulder County via KRCC.
A major disturbance event like this flood causes significant changes in a
landscape. The St. Vrain River in the image above completely shifted its course
of flow in less than 5 days! This brings major changes for aquatic organisms,
like crayfish, that lived along the old stream bed that is now bare and dry, or
for, terrestrial organisms, like a field vole, that used to have a burrow under
what is now the St. Vrain River. Likewise, the people living in the house that
is now on the west side of the river instead of the eastern bank have a
completely different yard and driveway!
Why might this storm have caused so much flooding?
What other weather patterns could have contributed to pronounced flooding?
Introduction to Disturbance Events
Definition: In ecology, a disturbance event
is a temporary change in environmental conditions that causes a pronounced
change in the ecosystem. Common disturbance events include floods, fires,
earthquakes, and tsunamis.
Within ecology, disturbance events are those events which cause dramatic change
in an ecosystem through a temporary, often rapid, change in environmental
conditions. Although the disturbance events themselves can be of short duration,
the ecological effects last decades, if not longer.
Common examples of natural ecological disturbances include hurricanes, fires,
floods, earthquakes and windstorms.
Common natural ecological disturbances.
Disturbance events can also be human caused: clear cuts when logging, fires to
clear forests for cattle grazing or the building of new housing developments
are all common disturbances.
Common human-caused ecological disturbances.
Ecological communities are often more resilient to some types of disturbance than
others. Some communities are even dependent on cyclical disturbance events.
Lodgepole pine (Pinuscontorta) forests in the Western US are dependent on
frequent stand-replacing fires to release seeds and spur the growth of young
trees.
Regrowth of Lodgepole Pine (Pinus contorta) after a stand-replacing fire.
Source: Jim Peaco, September 1998, Yellowstone Digital Slide Files;
Wikipedia Commons.
However, in discussions of ecological disturbance events we think about events
that disrupt the status of the ecosystem and change the structure of the
landscape.
In this lesson we will dig into the causes and disturbances caused during a storm
in September 2013 along the Colorado Front Range.
Driver: Climatic & Atmospheric Patterns
Drought
How do we measure drought?
Definition: The Palmer Drought Severity
Index is a measure of soil moisture content. It is calculated from soil
available water content,precipitation and temperature data. The values range
from extreme drought (values <-4.0) through near normal (-.49 to .49)
to extremely moist (>4.0).
Bonus: There are several other commonly used drought indices. The
National Drought Mitigation Center
provides a comparison of the different indices.
This interactive plot shows the Palmer Drought Severity Index from 1991 through
2015 for Colorado.
Palmer Drought Severity Index for Colorado 1991-2015. Source: National
Ecological Observatory Network (NEON) based on data from
National Climatic Data Center–NOAA.
Questions
Use the figure above to answer these questions:
In this dataset, what years are near normal, extreme drought, and
extreme wet on the Palmer Drought Severity Index?
What are the patterns of drought within Colorado that you observe using this
Palmer Drought Severity Index?
What were the drought conditions immediately before the September 2013
floods?
Over this decade and a half, there have been several cycles of dry and wet
periods. The 2013 flooding occurred right at the end of a severe drought.
There is a connection between the dry soils during a drought and the potential
for flooding. In a severe drought the top layers of the soil dry out. Organic
matter, spongy and absorbent in moist topsoil, may desiccate and be transported
by the winds to other areas. Some soil types, like clay, can dry to a
near-impermeable brick causing water to flow across the top instead of sinking
into the soils.
Atmospheric Conditions
In early September 2013, a slow moving cold front moved through Colorado
intersecting with a warm, humid front. The clash between the cold and warm
fronts yielded heavy rain and devastating flooding across the Front Range in
Colorado.
The storm that caused the 2013 Colorado flooding was kept in a confined area
over the Eastern Range of the Rocky Mountains in Colorado by these water vapor
systems.
Driver: Precipitation
How do we measure precipitation?
Definition: Precipitation is the moisture that
falls from clouds including rain, hail and snow.
Precipitation can be measured by different types of gauges; some must be
manually read and emptied, others automatically record the amount of
precipitation. If the precipitation is in a frozen form (snow, hail, freezing rain)
the contents of the gauge must be melted to get the water equivalency for
measurement. Rainfall is generally reported as the total amount of rain
(millimeters, centimeters, or inches) over a given per period of time.
Boulder, Colorado lays on the eastern edge of the Rocky Mountains where they meet
the high plains. The average annual precipitation is near 20". However, the
precipitation comes in many forms -- winter snow, intense summer thunderstorms,
and intermittent storms throughout the year.
The figure below show the total precipitation each month from 1948 to 2013 for
the National Weather Service's COOP site Boulder 2 (Station ID:050843) that is
centrally located in Boulder, CO.
Notice the general pattern of rainfall across the 65 years.
How much rain generally falls within one month?
Is there a strong annual or seasonal pattern? (Remember, with
interactive Plotly plots you can zoom in on the data)
Do any other events over the last 65 years equal the September 2013 event?
Now that we've looked at 65 years of data from Boulder, CO. Let's focus more
specifically on the September 2013 event. The plot below shows daily
precipitation between August 15 - October 15, 2013.
Explore the data and answer the following questions:
What dates were the highest precipitation values observed?
The USGS has a distributed network of aquatic sensors located in streams across
the United States. This network monitors a suit of variables that are important
to stream morphology and health. One of the metrics that this sensor network
monitors is stream discharge, a metric which quantifies the volume of water
moving down a stream. Discharge is an ideal metric to quantify flow, which
increases significantly during a flood event.
How is stream discharge measured?
Most USGS streamgages operate by measuring the elevation of the water in the
river or stream and then converting the water elevation (called 'stage') to a
streamflow ('discharge') by using a curve that relates the elevation to a set
of actual discharge measurements. This is done because currently the
technology is not available to measure the flow of the water accurately
enough directly. From the
USGS National Streamflow Information Program
A typical USGS stream gauge using a stilling well. Source: USGS.
What was the stream discharge prior to and during the flood events?
The data for the stream gauge along Boulder Creek 5 miles downstream of downtown
Boulder is reported in daily averages. Take a look at the interactive plot
below to see how patterns of discharge seen in these data?
And it isn't just floods, major hurricanes are forecast to strike New Orleans,
Louisiana once every
20 years.
Yet in 2005 New Orleans was pummeled by 4 hurricanes and 1
tropical storm. Hurricane Cindy in July 2013 caused the worst black out in New
Orleans for 40 years. Eight weeks later Hurricane Katrina came ashore over New
Orleans, changed the landscape of the city and became the costliest natural
disaster to date in the United States. It was frequently called a 100-year
storm.
If we say the return period is 20 years then how did 4 hurricanes strike New
Orleans in 1 year?
The return period of extreme events is also referred to as recurrenceinterval. It is an estimate of the likelihood of an extreme event
based on the statistical analysis of data (including flood records, fire
frequency, historical climatic records) that an event of a given magnitude will
occur in a given year. The probability can be used to assess the risk of these
events for human populations but can also be used by biologists when creating
habitat management plans or conservation plans for endangered species. The
concept is based on the magnitude-frequencyprinciple, where large magnitude
events (such as major hurricanes) are comparatively less frequent than smaller
magnitude incidents (such as rain showers). (For more information visit Climatica's Return Periods of Extreme Events.)
Question
Your friend is thinking about buying a house near Boulder Creek. The
house is above the level of seasonal high water but was flooded in the 2013
flood. He realizes how expensive flood insurance is and says, "Why do I have to
buy this insurance, a flood like that won't happen for another 100 years?
I won't live here any more." How would you explain to him that even though the
flood was a 100-year flood he should still buy the flood insurance?
Flood Plains
Definition: A flood plain is land adjacent to a waterway, from the
channel banks to the base of the enclosing valley walls, that experiences
flooding during periods of high discharge.
Flood plain through the city of Boulder. The LiDAR data used in
the lesson is from Four Mile Canyon Creek. Source:floodsafety.com.
Impact: Erosion & Sedimentation
How can we evaluate the impact of a flooding event?
We could view photos from before and after the disturbance event to see where
erosion or sedimentation has occurred.
Images are great for an overall impression of what happened, where soil has
eroded, and where soil or rocks have been deposited. But it is hard to
get measurements of change from these 2D images. There are several ways that we can
measure the apparent erosion and soil deposition.
3. Field Surveys
Standard surveys can be done to map the three-dimensional position of points allowing
for measurement of distance between points and elevation. However, this requires
extensive effort by land surveyors to visit each location and collect a large
number of points to precisely map the region. This method can be very time intensive.
Survey of a NEON field site done with a total station Source: Michael Patterson.
This method is challenging to use over a large spatial scale.
4. Stereoscopic Images
We could view stereoscopic images, two photos taken from slightly different
perspectives to give the illusion of 3D, one can view, and even measure,
elevation changes from 2D pictures.
A Sokkisha MS-16 stereoscope and the overlapping imaged used to
create 3-D visuals from a aerial photo. Source: Brian Haren.
However, this relies on specialized equipment and is challenging to automate.
5. LiDAR
A newer technology is Light Detection and Ranging (LiDAR or lidar).
Watch this video to see how LiDAR works.
Using LiDAR to Measure Change
LiDAR data allows us to create models of the earth's surface. The general term
for a model of the elevation of an area is a Digital Elevation Model (DEM). DEMs
come in two common types:
Digital Terrain Models (DTM): The elevation of the ground (terrain).
Digital Surface Models (DSM): The elevation of everything on the surface of the earth,
including trees, buildings, or other structures. Note: some DSMs have been post-processed to remove buildings and other human-made objects.
Digital Terrain Models and Digital Surface Models are two common
LiDAR-derived data products. The digital terrain model allows scientists to
study changes in terrain (topography) over time.
Digital Terrain Models (DTMs)
Here we have Digital Terrain Model of lower Four-Mile Canyon Creek from before
the 2013 flood event (left) and from after the 2013 flood event (right). We can
see some subtle differences in the elevation, especially along the stream bed,
however, even on this map it is still challenging to see.
Digital Elevation Model of Difference (DoD)
If we have a DEM from before and after an event, we can can create a model that
shows the change that occurred during the event. This new model is called a
Digital Elevation Model of Difference (DoD).
A cross-section showing the data represented by a Digital
Elevation Model of Difference (DoD) created by subtracting one DTM from
another. The resultant DoD shows the change that has occurred in a given
location- here, in orange, the areas where the earth's surface is lower than
before and, in teal, the areas where the earth's surface is higher than
before.
Questions
Here we are using DTMs to create our Digital Elevation Model of Difference (DoD)
to look at the changes after a flooding event. What types of disturbance events
or what research question might one want to look at DoDs from Digital Surface
Models?
Four Mile Canyon Creek DoD
Areas in red are those were the elevation is lower after the flood
and areas in blue are those where the elevation is higher after the flood event.
We've used the data from drought, atmospheric conditions, precipitation, stream flow,
and the digital elevation models to help us understand what caused the 2013
Boulder County, Colorado flooding event and where there was change in the stream
bed around Four Mile Canyon Creek at the outskirts of Boulder, Colorado.
Quantifying the change that we can see from images of the flooding streams or
post-flood changes to low lying areas allows scientists, city planners, and
homeowners to make choices about future land use plans.
Follow-up Questions
What other types of questions could this or similar data be used to answer?
What types of disturbance events in your local area could you use data to
quantify the causes and impact of?
Several factors contributed to extreme flooding that occurred in Boulder,
Colorado in 2013. In this data activity, we explore and visualize the data for
precipitation (rainfall) data collected by the National Weather Service's
Cooperative Observer Program. The tutorial is part of the Data Activities that can be used
with the
Quantifying The Drivers and Impacts of Natural Disturbance Events Teaching Module.
Learning Objectives
After completing this tutorial, you will be able to:
Publish & share an interactive plot of the data using Plotly.
Subset data by date (if completing Additional Resources code).
Set a NoData Value to NA in R (if completing Additional Resources code).
Things You'll Need To Complete This Lesson
Please be sure you have the most current version of R and, preferably,
RStudio to write your code.
R Libraries to Install:
ggplot2:install.packages("ggplot2")
plotly:install.packages("plotly")
Data to Download
Part of this lesson is to access and download the data directly from NOAA's
NCEI Database. If instead you would prefer to download the data as a single compressed file, it can be downloaded from the
NEON Data Skills account on FigShare.
Set Working Directory This lesson assumes that you have set your working
directory to the location of the downloaded and unzipped data subsets.
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.
Research Question
What were the patterns of precipitation leading up to the 2013 flooding events in Colorado?
Precipitation Data
The heavy precipitation (rain) that occurred in September 2013 caused much
damage during the 2013 flood by, among other things, increasing
stream discharge (flow). In this lesson we will download, explore, and
visualize the precipitation data collected during this time to better understand
this important flood driver.
Where can we get precipitation data?
The precipitation data are obtained through
NOAA's NECI database. There are numerous
datasets that can be found and downloaded via the
CDO Search portal.
"Through the National Weather Service (NWS) Cooperative Observer Program
(COOP), more than 10,000 volunteers take daily weather observations at National
Parks, seashores, mountaintops, and farms as well as in urban and suburban
areas. COOP data usually consist of daily maximum and minimum temperatures,
snowfall, and 24-hour precipitation totals."
Quoted from NOAA's National Centers for Environmental Information
Data are collected at different stations, often on paper data sheets like the
one below, and then entered into a central database where we can access that data
and download it in the .csv (Comma Separated Values) format.
An example of the data sheets used to collect the precipitation
data for the Cooperative Observer Network. Source: Cooperative Observer
Network, NOAA
Note: If you are using the pre-compiled data subset that can be downloaded as a
compressed file above, you should still read through this section to know where
the data comes from before proceeding with the lesson.
Step 1: Search for the data
To obtain data we must first choose a location of interest.
The COOP site Boulder 2 (Station ID:050843) is centrally located in Boulder, CO.
Cooperative Observer Network station 050843 is located in
central Boulder, CO. Source: National Centers for Environmental Information
Then we must decide what type of data we want to download for that station. As
shown in the image below, we selected:
the desired date range (1 January 2003 to 31 December 2013),
the type of dataset ("Precipitation Hourly"),
the search type ("Stations") and
the search term (e.g. the # for the station located in central Boulder, CO: 050843).
CDO search page for the central Boulder, CO, station:050843
Once the data are entered and you select Search, you will be directed to a
new page with a map. You can find out more information about the data by selecting
View Full Details.
Notice that this dataset goes all the way back to 1 August 1948! However, we've
selected only a portion of this time frame.
Step 2: Request the data
Once you are sure this is the data that you want, you need to request it by
selecting Add to Cart. The data can then be downloaded as a .csv file
which we will use to conduct our analyses. Be sure to double check the date
range before downloading.
On the options page, we want to make sure we select:
Station Name
Geographic Location (this gives us longitude & latitude; optional)
Include Data Flags (this gives us information if the data are problematic)
Units (Standard)
Precipitation (w/ HPCP automatically checked)
On the next page you must enter an email address for the dataset to be sent
to.
Step 3: Get the data
As this is a small dataset, it won't take long for you to get an email telling
you the dataset is ready. Follow the link in the email to download your dataset.
You can also view documentation (metadata) for the data.
Each data subset is downloaded with a unique order number. The order number in
our example dataset is 805325. If you are using a dataset you've downloaded
yourself, make sure to substitute in your own order number in the code below.
To ensure that we remember what our data file is, we've added a descriptor to
the order number: 805325-precip_daily_2003-2013. You may wish to do the same.
Work with Precipitation Data
R Libraries
We will use ggplot2 to efficiently plot our data and plotly to create i
nteractive plots.
# load packages
library(ggplot2) # create efficient, professional plots
library(plotly) # create cool interactive plots
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggmap':
##
## wind
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# set your working directory to ensure R can find the file we wish to import
# and where we want to save our files. Be sure to move the download into
# your working direectory!
wd <- "~/Git/data/" # This will depend on your local environment
setwd(wd)
Import Precipitation Data
We will use the 805325-Preciptation_daily_2003-2013.csv file
in this analysis. This dataset is the daily precipitation date from the COOP
station 050843 in Boulder, CO for 1 January 2003 through 31 December 2013.
Since the data format is a .csv, we can use read.csv to import the data. After
we import the data, we can take a look at the first few lines using head(),
which defaults to the first 6 rows, of the data.frame. Finally, we can explore
the R object structure.
# import precip data into R data.frame by
# defining the file name to be opened
precip.boulder <- read.csv(paste0(wd,"disturb-events-co13/precip/805325-precip_daily_2003-2013.csv"), stringsAsFactors = FALSE, header = TRUE )
# view first 6 lines of the data
head(precip.boulder)
## STATION STATION_NAME ELEVATION LATITUDE LONGITUDE
## 1 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811
## 2 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811
## 3 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811
## 4 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811
## 5 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811
## 6 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811
## DATE HPCP Measurement.Flag Quality.Flag
## 1 20030101 01:00 0.0 g
## 2 20030201 01:00 0.0 g
## 3 20030202 19:00 0.2
## 4 20030202 22:00 0.1
## 5 20030203 02:00 0.1
## 6 20030205 02:00 0.1
# view structure of data
str(precip.boulder)
## 'data.frame': 1840 obs. of 9 variables:
## $ STATION : chr "COOP:050843" "COOP:050843" "COOP:050843" "COOP:050843" ...
## $ STATION_NAME : chr "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" ...
## $ ELEVATION : num 1650 1650 1650 1650 1650 ...
## $ LATITUDE : num 40 40 40 40 40 ...
## $ LONGITUDE : num -105 -105 -105 -105 -105 ...
## $ DATE : chr "20030101 01:00" "20030201 01:00" "20030202 19:00" "20030202 22:00" ...
## $ HPCP : num 0 0 0.2 0.1 0.1 ...
## $ Measurement.Flag: chr "g" "g" " " " " ...
## $ Quality.Flag : chr " " " " " " " " ...
About the Data
Viewing the structure of these data, we can see that different types of data are included in
this file.
STATION and STATION_NAME: Identification of the COOP station.
ELEVATION, LATITUDE and LONGITUDE: The spatial location of the station.
DATE: Gives the date in the format: YYYYMMDD HH:MM. Notice that DATE is
currently class chr, meaning the data are interpreted as a character class and
not as a date.
HPCP: The total precipitation given in inches
(since we selected Standard for the units), recorded
for the hour ending at the time specified by DATE. Importantly, the metadata
(see below) notes that the value 999.99 indicates missing data. Also important,
hours with no precipitation are not recorded.
Measurement Flag: Indicates if there are any abnormalities with the
measurement of the data. Definitions of each flag can be found in Table 2 of the
documentation.
Quality Flag: Indicates if there are any potential quality problems with
the data. Definitions of each flag can be found in Table 3 of the documentation.
Additional information about the data, known as metadata, is available in the
PRECIP_HLY_documentation.pdf file that can be downloaded along with the data.
(Note, as of Sept. 2016, there is a mismatch in the data downloaded and the
documentation. The differences are in the units and missing data value:
inches/999.99 (standard) or millimeters/25399.75 (metric)).
Clean the Data
Before we can start plotting and working with the data we always need to check
several important factors:
data class: is R interpreting the data the way we expect it. The function
str() is an important tools for this.
NoData Values: We need to know if our data contains a specific value that
means "data are missing" and if this value has been assigned to NA in R.
Convert Date-Time
As we've noted, the date field is in a character class. We can convert it to a date/time
class that will allow R to correctly interpret the data and allow us to easily
plot the data. We can convert it to a date/time class using as.POSIXct().
# convert to date/time and retain as a new field
precip.boulder$DateTime <- as.POSIXct(precip.boulder$DATE,
format="%Y%m%d %H:%M")
# date in the format: YearMonthDay Hour:Minute
# double check structure
str(precip.boulder$DateTime)
## POSIXct[1:1840], format: "2003-01-01 01:00:00" "2003-02-01 01:00:00" ...
We've also learned that missing values, also known as NoData
values, are labelled with the placeholder 999.99. Do we have any NoData values in our data?
# histogram - would allow us to see 999.99 NA values
# or other "weird" values that might be NA if we didn't know the NA value
hist(precip.boulder$HPCP)
Looking at the histogram, it looks like we have mostly low values (which makes sense) but a few values
up near 1000 -- likely 999.99. We can assign these entries to be NA, the value that
R interprets as no data.
# assing NoData values to NA
precip.boulder$HPCP[precip.boulder$HPCP==999.99] <- NA
# check that NA values were added;
# we can do this by finding the sum of how many NA values there are
sum(is.na(precip.boulder))
## [1] 94
There are 94 NA values in our dataset. This is missing data.
Questions:
Do we need to worry about the missing data?
Could they affect our analyses?
This depends on what questions we are asking. Here we are looking at
general patterns in the data across 10 years. Since we have just over 3650
days in our entire dataset, missing 94 probably won't affect the general trends
we are looking at.
Can you think of a research question where we would need to be concerned about
the missing data?
Plot Precipitation Data
Now that we've cleaned up the data, we can view it. To do this we will plot using
ggplot() from the ggplot2 package.
#plot the data
precPlot_hourly <- ggplot(data=precip.boulder, # the data frame
aes(DateTime, HPCP)) + # the variables of interest
geom_bar(stat="identity") + # create a bar graph
xlab("Date") + ylab("Precipitation (Inches)") + # label the x & y axes
ggtitle("Hourly Precipitation - Boulder Station\n 2003-2013") # add a title
precPlot_hourly
## Warning: Removed 94 rows containing missing values (position_stack).
As we can see, plots of hourly date lead to very small numbers and is difficult
to represent all information on a figure. Hint: If you can't see any bars on your
plot you might need to zoom in on it.
Plots and comparison of daily precipitation would be easier to view.
Plot Daily Precipitation
There are several ways to aggregate the data.
Daily Plots
If you only want to view the data plotted by date you need to create a column
with only dates (no time) and then re-plot.
# convert DATE to a Date class
# (this will strip the time, but that is saved in DateTime)
precip.boulder$DATE <- as.Date(precip.boulder$DateTime, # convert to Date class
format="%Y%m%d %H:%M")
#DATE in the format: YearMonthDay Hour:Minute
# double check conversion
str(precip.boulder$DATE)
## Date[1:1840], format: "2003-01-01" "2003-02-01" "2003-02-03" "2003-02-03" ...
precPlot_daily1 <- ggplot(data=precip.boulder, # the data frame
aes(DATE, HPCP)) + # the variables of interest
geom_bar(stat="identity") + # create a bar graph
xlab("Date") + ylab("Precipitation (Inches)") + # label the x & y axes
ggtitle("Daily Precipitation - Boulder Station\n 2003-2013") # add a title
precPlot_daily1
## Warning: Removed 94 rows containing missing values (position_stack).
R will automatically combine all data from the same day and plot it as one entry.
Daily Plots & Data
If you want to record the combined hourly data for each day, you need to create a new data frame to store the daily data. We can
use the aggregate() function to combine all the hourly data into daily data.
We will use the date class DATE field we created in the previous code for this.
# aggregate the Precipitation (PRECIP) data by DATE
precip.boulder_daily <-aggregate(precip.boulder$HPCP, # data to aggregate
by=list(precip.boulder$DATE), # variable to aggregate by
FUN=sum, # take the sum (total) of the precip
na.rm=TRUE) # if the are NA values ignore them
# if this is FALSE any NA value will prevent a value be totalled
# view the results
head(precip.boulder_daily)
## Group.1 x
## 1 2003-01-01 0.0
## 2 2003-02-01 0.0
## 3 2003-02-03 0.4
## 4 2003-02-05 0.2
## 5 2003-02-06 0.1
## 6 2003-02-07 0.1
So we now have daily data but the column names don't mean anything. We can
give them meaningful names by using the names() function. Instead of naming the column of
precipitation values with the original HPCP, let's call it PRECIP.
# plot daily data
precPlot_daily <- ggplot(data=precip.boulder_daily, # the data frame
aes(DATE, PRECIP)) + # the variables of interest
geom_bar(stat="identity") + # create a bar graph
xlab("Date") + ylab("Precipitation (Inches)") + # label the x & y axes
ggtitle("Daily Precipitation - Boulder Station\n 2003-2013") # add a title
precPlot_daily
Compare this plot to the plot we created using the first method. Are they the same?
R Tip: This manipulation, or aggregation, of data
can also be done with the package plyr using the summarize() function.
Subset the Data
Instead of looking at the data for the full decade, let's now focus on just the
2 months surrounding the flood on 11-15 September. We'll focus on the window from 15
August to 15 October.
Just like aggregating, we can accomplish this by interacting with the larger plot through the graphical interface or
by creating a subset of the data and protting it separately.
Subset Within Plot
To see only a subset of the larger plot, we can simply set limits for the
scale on the x-axis with scale_x_date().
# First, define the limits -- 2 months around the floods
limits <- as.Date(c("2013-08-15", "2013-10-15"))
# Second, plot the data - Flood Time Period
precPlot_flood <- ggplot(data=precip.boulder_daily,
aes(DATE, PRECIP)) +
geom_bar(stat="identity") +
scale_x_date(limits=limits) +
xlab("Date") + ylab("Precipitation (Inches)") +
ggtitle("Precipitation - Boulder Station\n August 15 - October 15, 2013")
precPlot_flood
## Warning: Removed 770 rows containing missing values (position_stack).
Now we can easily see the dramatic rainfall event in mid-September!
R Tip: If you are using a date-time class, instead
of just a date class, you need to use scale_x_datetime().
Subset The Data
Now let's create a subset of the data and plot it.
# subset 2 months around flood
precip.boulder_AugOct <- subset(precip.boulder_daily,
DATE >= as.Date('2013-08-15') &
DATE <= as.Date('2013-10-15'))
# check the first & last dates
min(precip.boulder_AugOct$DATE)
## [1] "2013-08-21"
max(precip.boulder_AugOct$DATE)
## [1] "2013-10-11"
# create new plot
precPlot_flood2 <- ggplot(data=precip.boulder_AugOct, aes(DATE,PRECIP)) +
geom_bar(stat="identity") +
xlab("Time") + ylab("Precipitation (inches)") +
ggtitle("Daily Total Precipitation \n Boulder Creek 2013")
precPlot_flood2
Interactive Plots - Plotly
Let's turn our plot into an interactive Plotly plot.
# setup your plot.ly credentials; if not already set up
#Sys.setenv("plotly_username"="your.user.name.here")
#Sys.setenv("plotly_api_key"="your.key.here")
# view plotly plot in R
ggplotly(precPlot_flood2)
# publish plotly plot to your plot.ly online account when you are happy with it
api_create(precPlot_flood2)
Challenge: Plot Precip for Boulder Station Since 1948
The Boulder precipitation station has been recording data since 1948. Use the
steps above to create a plot of the full record of precipitation at this station (1948 - 2013).
The full dataset takes considerable time to download, so we recommend you use the dataset provided in the compressed file ("805333-precip_daily_1948-2013.csv").
As an added challenge, aggregate the data by month instead of by day.
Additional Resources
Units & Scale
If you are using a dataset downloaded before 2016, the units were in
hundredths of an inch, this isn't the most useful measure. You might want to
create a new column PRECIP that contains the data from HPCP converted to
inches.
# convert from 100th inch by dividing by 100
precip.boulder$PRECIP<-precip.boulder$HPCP/100
# view & check to make sure conversion occurred
head(precip.boulder)
## STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE
## 1 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-01-01
## 2 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-01
## 3 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-03
## 4 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-03
## 5 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-03
## 6 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-05
## HPCP Measurement.Flag Quality.Flag DateTime PRECIP
## 1 0.0 g 2003-01-01 01:00:00 0.000
## 2 0.0 g 2003-02-01 01:00:00 0.000
## 3 0.2 2003-02-02 19:00:00 0.002
## 4 0.1 2003-02-02 22:00:00 0.001
## 5 0.1 2003-02-03 02:00:00 0.001
## 6 0.1 2003-02-05 02:00:00 0.001
Question
Compare HPCP and PRECIP. Did we do the conversion correctly?
This data download contains several files. You will only need the RGB .tif files
for this tutorial. The path to this file is: NEON-DS-Field-Site-Spatial-Data/SJER/RGB/* .
The other data files in the downloaded data directory are used for related tutorials.
You should set your working directory to the parent directory of the downloaded
data to follow the code exactly.
Recommended Reading
You may benefit from reviewing these related resources prior to this tutorial:
Raster or "gridded" data are data that are saved in pixels. In the spatial world,
each pixel represents an area on the Earth's surface. An color image raster is
a bit different from other rasters in that it has multiple bands. Each band
represents reflectance values for a particular color or spectra of light. If the
image is RGB, then the bands are in the red, green and blue portions of the
electromagnetic spectrum. These colors together create what we know as a full
color image.
A color image at the NEON San Joaquin Experimental Range (SJER)
field site in California. Each pixel in the image represents the combined
reflectance in the red, green and blue portions of the electromagnetic spectrum.
Source: National Ecological Observatory Network (NEON)
Work with Multiple Rasters
In
a previous tutorial,
we loaded a single raster into R. We made sure we knew the CRS
(coordinate reference system) and extent of the dataset among other key metadata
attributes. This raster was a Digital Elevation Model so there was only a single
raster that represented the ground elevation in each pixel. When we work with
color images, there are multiple rasters to represent each band. Here we'll learn
to work with multiple rasters together.
Raster Stacks
A raster stack is a collection of raster layers. Each raster layer in the raster
stack needs to have the same
projection (CRS),
spatial extent and
resolution.
You might use raster stacks for different reasons. For instance, you might want to
group a time series of rasters representing precipitation or temperature into
one R object. Or, you might want to create a color images from red, green and
blue band derived rasters.
In this tutorial, we will stack three bands from a multi-band image together to
create a composite RGB image.
First let's load the R packages that we need: sp and raster.
# load the raster, sp, and rgdal packages
library(raster)
library(sp)
library(rgdal)
# set the working directory to the data
#setwd("pathToDirHere")
wd <- ("~/Git/data/")
setwd(wd)
Next, let's create a raster stack with bands representing
blue: band 19, 473.8nm
green: band 34, 548.9nm
red; band 58, 669.1nm
This can be done by individually assigning each file path as an object.
# import tiffs
band19 <- paste0(wd, "NEON-DS-Field-Site-Spatial-Data/SJER/RGB/band19.tif")
band34 <- paste0(wd, "NEON-DS-Field-Site-Spatial-Data/SJER/RGB/band34.tif")
band58 <- paste0(wd, "NEON-DS-Field-Site-Spatial-Data/SJER/RGB/band58.tif")
# View their attributes to check that they loaded correctly:
band19
## [1] "~/Git/data/NEON-DS-Field-Site-Spatial-Data/SJER/RGB/band19.tif"
band34
## [1] "~/Git/data/NEON-DS-Field-Site-Spatial-Data/SJER/RGB/band34.tif"
band58
## [1] "~/Git/data/NEON-DS-Field-Site-Spatial-Data/SJER/RGB/band58.tif"
Note that if we wanted to create a stack from all the files in a directory (folder)
you can easily do this with the list.files() function. We would use
full.names=TRUE to ensure that R will store the directory path in our list of
rasters.
# create list of files to make raster stack
rasterlist1 <- list.files(paste0(wd,"NEON-DS-Field-Site-Spatial-Data/SJER/RGB", full.names=TRUE))
rasterlist1
## character(0)
Or, if your directory consists of some .tif files and other file types you
don't want in your stack, you can ask R to only list those files with a .tif
extension.
Back to creating our raster stack with three bands. We only want three of the
bands in the RGB directory and not the fourth band90, so will create the stack
from the bands we loaded individually. We do this with the stack() function.
# create raster stack
rgbRaster <- stack(band19,band34,band58)
# example syntax for stack from a list
#rstack1 <- stack(rasterlist1)
This has now created a stack that is three rasters thick. Let's view them.
From the attributes we see the CRS, resolution, and extent of all three rasters.
The we can see each raster plotted. Notice the different shading between the
different bands. This is because the landscape relects in the red, green, and
blue spectra differently.
Check out the scale bars. What do they represent?
This reflectance data are radiances corrected for atmospheric effects. The data
are typically unitless and ranges from 0-1. NEON Airborne Observation Platform
data, where these rasters come from, has a scale factor of 10,000.
Plot an RGB Image
You can plot a composite RGB image from a raster stack. You need to specify the
order of the bands when you do this. In our raster stack, band 19, which is the
blue band, is first in the stack, whereas band 58, which is the red band, is last.
Thus the order for a RGB image is 3,2,1 to ensure the red band is rendered first
as red.
Thinking ahead to next time: If you know you want to create composite RGB images,
think about the order of your rasters when you make the stack so the RGB=1,2,3.
We will plot the raster with the rgbRaster() function and the need these
following arguments:
R object to plot
which layer of the stack is which color
stretch: allows for increased contrast. Options are "lin" & "hist".
Let's try it.
# plot an RGB version of the stack
plotRGB(rgbRaster,r=3,g=2,b=1, stretch = "lin")
Note: read the raster package documentation for other arguments that can be
added (like scale) to improve or modify the image.
Explore Raster Values - Histograms
You can also explore the data. Histograms allow us to view the distrubiton of
values in the pixels.
# view histogram of reflectance values for all rasters
hist(rgbRaster)
## Warning in .hist1(raster(x, y[i]), maxpixels = maxpixels, main =
## main[y[i]], : 42% of the raster cells were used. 100000 values used.
## Warning in .hist1(raster(x, y[i]), maxpixels = maxpixels, main =
## main[y[i]], : 42% of the raster cells were used. 100000 values used.
## Warning in .hist1(raster(x, y[i]), maxpixels = maxpixels, main =
## main[y[i]], : 42% of the raster cells were used. 100000 values used.
Note about the warning messages: R defaults to only showing the first 100,000
values in the histogram so if you have a large raster you may not be seeing all
the values. This saves your from long waits, or crashing R, if you have large
datasets.
Crop Rasters
You can crop all rasters within a raster stack the same way you'd do it with a
single raster.
### Challenge: Plot Cropped RGB
Plot this new cropped stack as an RGB image.
Raster Bricks in R
In our rgbRaster object we have a list of rasters in a stack. These rasters
are all the same extent, CRS and resolution. By creating a raster brick we
will create one raster object that contains all of the rasters so that we can
use this object to quickly create RGB images. Raster bricks are more efficient
objects to use when processing larger datasets. This is because the computer
doesn't have to spend energy finding the data - it is contained within the object.
Notice the faster plotting? For a smaller raster like this the difference is
slight, but for larger raster it can be considerable.
Write to GeoTIFF
We can write out the raster in GeoTIFF format as well. When we do this it will
copy the CRS, extent and resolution information so the data will read properly
into a GIS program as well. Note that this writes the raster in the order they
are in. In our case, the blue (band 19) is first but most programs expect the
red band first (RGB).
One way around this is to generate a new raster stack with the rasters in the
proper order - red, green and blue format. Or, just always create your stacks
R->G->B to start!!!
# Make a new stack in the order we want the data in
orderRGBstack <- stack(rgbRaster$band58,rgbRaster$band34,rgbRaster$band19)
# write the geotiff
# change overwrite=TRUE to FALSE if you want to make sure you don't overwrite your files!
writeRaster(orderRGBstack,paste0(wd,"NEON-DS-Field-Site-Spatial-Data/SJER/RGB/rgbRaster.tif"),"GTiff", overwrite=TRUE)
Import A Multi-Band Image into R
You can import a multi-band image into R too. To do this, you import the file as
a stack rather than a raster (which brings in just one band). Let's import the
raster than we just created above.
# import multi-band raster as stack
multiRasterS <- stack(paste0(wd,"NEON-DS-Field-Site-Spatial-Data/SJER/RGB/rgbRaster.tif"))
# import multi-band raster direct to brick
multiRasterB <- brick(paste0(wd,"NEON-DS-Field-Site-Spatial-Data/SJER/RGB/rgbRaster.tif"))
# view raster
plot(multiRasterB)
You will need an free online Plotly account to post & share you plots online. But
you can create the plots and use them on your local computer without an account.
If you do not wish to share plots online you can skip to
Step 3: Create Plotly plot.
Note: Plotly doesn't just work with R -- other programs include Python, MATLAB,
Excel, and JavaScript.
Step 1: Create account
If you do not already have an account, you need to set up an account by visiting
the Plotly website and following
the directions there.
Step 2: Connect account to R
To share plots from R (or RStudio) to Plotly, you have to connect to your
account. This is done through an API (Application Program Interface). You can
find your username & API key in your profile settings on the Plotly website
under the "API key" menu option.
To link your account to your R, use the following commands, substituting in your
own username & key as appropriate.
# set plotly user name
Sys.setenv("plotly_username"="YOUR_USERNAME")
# set plotly API key
Sys.setenv("plotly_api_key"="YOUR_KEY")
Step 3: Create Plotly plot
There are lots of ways to plot with the plotly package. We briefly describe two
basic functions plotly() and ggplotly(). For more information on plotting in
R with Plotly, check out the
Plotly R library page.
Here we use the example dataframe economics that comes with the package.
# load packages
library(ggplot2) # to create plots and feed to ggplotly()
library(plotly) # to create interactive plots
# view str of example dataset
str(economics)
## tibble [574 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:574], format: "1967-07-01" "1967-08-01" ...
## $ pce : num [1:574] 507 510 516 512 517 ...
## $ pop : num [1:574] 198712 198911 199113 199311 199498 ...
## $ psavert : num [1:574] 12.6 12.6 11.9 12.9 12.8 11.8 11.7 12.3 11.7 12.3 ...
## $ uempmed : num [1:574] 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
## $ unemploy: num [1:574] 2944 2945 2958 3143 3066 ...
# plot with the plot_ly function
unempPerCapita <- plot_ly(x =economics$date, y = economics$unemploy/economics$pop)
To make your plotly plot in R, run the following line:
unempPerCapita
Note: This plot is interactive within the R environment but is not as posted on
this website.
If you already use ggplot to create your plots, you can directly turn your
ggplot objects into interactive plots with ggplotly().
## plot with ggplot, then ggplotly
unemployment <- ggplot(economics, aes(date,unemploy)) + geom_line()
unemployment
To make your plotly plot in R, run the following line:
ggplotly(unemployment)
Note: This plot is interactive within the R environment but is not as posted on
this website.
Step 4: Publish to Plotly
The function plotly_POST() allows you to post any plotly plot to your account.
# publish plotly plot to your plotly online account
api_create(unemployment)
Examples
The plots below were generated using R code that harnesses the power of the
ggplot2 and the plotly packages. The plotly code utilizes the
RopenSci plotly packages - check them out!
Data Tip Are you a Python user? Use matplotlib
to create and publish visualizations.
After completing this tutorial, you will be able to:
Explain what the Hierarchical Data Format (HDF5) is.
Describe the key benefits of the HDF5 format, particularly related to big data.
Describe both the types of data that can be stored in HDF5 and how it can be stored/structured.
About Hierarchical Data Formats - HDF5
The Hierarchical Data Format version 5 (HDF5), is an open source file format
that supports large, complex, heterogeneous data. HDF5 uses a "file directory"
like structure that allows you to organize data within the file in many different
structured ways, as you might do with files on your computer. The HDF5 format
also allows for embedding of metadata making it self-describing.
**Data Tip:** HDF5 is one hierarchical data format,
that builds upon both HDF4 and NetCDF (two other hierarchical data formats).
Read more about HDF5 here.
Hierarchical Structure - A file directory within a file
The HDF5 format can be thought of as a file system contained and described
within one single file. Think about the files and folders stored on your computer.
You might have a data directory with some temperature data for multiple field
sites. These temperature data are collected every minute and summarized on an
hourly, daily and weekly basis. Within one HDF5 file, you can store a similar
set of data organized in the same way that you might organize files and folders
on your computer. However in a HDF5 file, what we call "directories" or "folders"
on our computers, are called groups and what we call files on our
computer are called datasets.
2 Important HDF5 Terms
Group: A folder like element within an HDF5 file that might contain other
groups OR datasets within it.
Dataset: The actual data contained within the HDF5 file. Datasets are often
(but don't have to be) stored within groups in the file.
An example HDF5 file structure which contains groups, datasets and associated metadata.
An HDF5 file containing datasets, might be structured like this:
An example HDF5 file structure containing data for multiple field sites and also containing various datasets (averaged at different time intervals).
HDF5 is a Self Describing Format
HDF5 format is self describing. This means that each file, group and dataset
can have associated metadata that describes exactly what the data are. Following
the example above, we can embed information about each site to the file, such as:
The full name and X,Y location of the site
Description of the site.
Any documentation of interest.
Similarly, we might add information about how the data in the dataset were
collected, such as descriptions of the sensor used to collect the temperature
data. We can also attach information, to each dataset within the site group,
about how the averaging was performed and over what time period data are available.
One key benefit of having metadata that are attached to each file, group and
dataset, is that this facilitates automation without the need for a separate
(and additional) metadata document. Using a programming language, like R or
Python, we can grab information from the metadata that are already associated
with the dataset, and which we might need to process the dataset.
HDF5 files are self describing - this means that all elements
(the file itself, groups and datasets) can have associated metadata that
describes the information contained within the element.
Compressed & Efficient subsetting
The HDF5 format is a compressed format. The size of all data contained within
HDF5 is optimized which makes the overall file size smaller. Even when
compressed, however, HDF5 files often contain big data and can thus still be
quite large. A powerful attribute of HDF5 is data slicing, by which a
particular subsets of a dataset can be extracted for processing. This means that
the entire dataset doesn't have to be read into memory (RAM); very helpful in
allowing us to more efficiently work with very large (gigabytes or more) datasets!
Heterogeneous Data Storage
HDF5 files can store many different types of data within in the same file. For
example, one group may contain a set of datasets to contain integer (numeric)
and text (string) data. Or, one dataset can contain heterogeneous data types
(e.g., both text and numeric data in one dataset). This means that HDF5 can store
any of the following (and more) in one file:
Temperature, precipitation and PAR (photosynthetic active radiation) data for
a site or for many sites
A set of images that cover one or more areas (each image can have specific
spatial information associated with it - all in the same file)
A multi or hyperspectral spatial dataset that contains hundreds of bands.
Field data for several sites characterizing insects, mammals, vegetation and
meteorology.
A set of images that cover one or more areas (each image can have unique
spatial information associated with it)
And much more!
Open Format
The HDF5 format is open and free to use. The supporting libraries (and a free
viewer), can be downloaded from the
HDF Group
website. As such, HDF5 is widely supported in a host of programs, including
open source programming languages like R and Python, and commercial
programming tools like Matlab and IDL. Spatial data that are stored in HDF5
format can be used in GIS and imaging programs including QGIS, ArcGIS, and
ENVI.
Summary Points - Benefits of HDF5
Self-Describing The datasets with an HDF5 file are self describing. This
allows us to efficiently extract metadata without needing an additional metadata
document.
Supporta Heterogeneous Data: Different types of datasets can be contained
within one HDF5 file.
Supports Large, Complex Data: HDF5 is a compressed format that is designed
to support large, heterogeneous, and complex datasets.
Supports Data Slicing: "Data slicing", or extracting portions of the
dataset as needed for analysis, means large files don't need to be completely
read into the computers memory or RAM.
Open Format - wide support in the many tools: Because the HDF5 format is
open, it is supported by a host of programming languages and tools, including
open source languages like R and Python and open GIS tools like QGIS.