Data Activity: Visualize Precipitation Data in R to Better Understand the 2013 Colorado Floods
Several factors contributed to extreme flooding that occurred in Boulder, Colorado in 2013. In this data activity, we explore and visualize the data for precipitation (rainfall) data collected by the National Weather Service's Cooperative Observer Program. The tutorial is part of the Data Activities that can be used with the Quantifying The Drivers and Impacts of Natural Disturbance Events Teaching Module.
After completing this tutorial, you will be able to:
- Download precipitation data from NOAA's National Centers for Environmental Information.
- Plot precipitation data in R.
- Publish & share an interactive plot of the data using Plotly.
- Subset data by date (if completing Additional Resources code).
- Set a NoData Value to NA in R (if completing Additional Resources code).
Things You'll Need To Complete This Lesson
Please be sure you have the most current version of R and, preferably, RStudio to write your code.
R Libraries to Install:
Data Download & Directory Preparation
Part of this lesson is to access and download the data directly from NOAA's National Climate Divisional Database. If instead you would prefer to download the data as a single compressed file, it can be downloaded from the NEON Data Skills account on FigShare.
To more easily follow along with this lesson, use the same organization for your
files and folders as we did. First, create a
data directory (folder) within
Documents directory. If you downloaded the compressed data file above,
unzip this file and place the
distub-events-co13 folder within the
directory you created. If you are planning to access the data directly as
described in the lesson, create a new directory called
data folder and then within it create another directory called
precip. If you choose to save your files elsewhere in your file structure, you
will need to modify the directions in the lesson to set your working directory
What were the patterns of precipitation leading up to the 2013 flooding events in Colorado?
The heavy precipitation (rain) that occurred in September 2013 caused much damage during the 2013 flood by, among other things, increasing stream discharge (flow). In this lesson we will download, explore, and visualize the precipitation data collected during this time to better understand this important flood driver.
Where can we get precipitation data?
The precipitation data are obtained through NOAA's National Centers for Environmental Information (formerly the National Climatic Data Center). There are numerous climatic datasets that can be found and downloaded via the Climate Data Online Search portal.
The precipitation data that we will use is from the Cooperative Observer Network (COOP).
"Through the National Weather Service (NWS) Cooperative Observer Program (COOP), more than 10,000 volunteers take daily weather observations at National Parks, seashores, mountaintops, and farms as well as in urban and suburban areas. COOP data usually consist of daily maximum and minimum temperatures, snowfall, and 24-hour precipitation totals." Quoted from NOAA's National Centers for Environmental Information
Data are collected at different stations, often on paper data sheets like the one below, and then entered into a central database where we can access that data and download it in the .csv (Comma Separated Values) format.
Obtain the Data
If you have not already opened the Climate Data Online Search portal, do so now.
Note: If you are using the pre-compiled data subset that can be downloaded as a compressed file above, you should still read through this section to know where the data comes from before proceeding with the lesson.
Step 1: Search for the data
To obtain data we must first choose a location of interest. The COOP site Boulder 2 (Station ID:050843) is centrally located in Boulder, CO.
Then we must decide what type of data we want to download for that station. As shown in the image below, we selected:
- the desired date range (1 January 2003 to 31 December 2013),
- the type of dataset ("Precipitation Hourly"),
- the search type ("Stations") and
- the search term (e.g. the # for the station located in central Boulder, CO: 050843).
Once the data are entered and you select
Search, you will be directed to a
new page with a map. You can find out more information about the data by selecting
View Full Details.
Notice that this dataset goes all the way back to 1 August 1948! However, we've
selected only a portion of this time frame.
Step 2: Request the data
Once you are sure this is the data that you want, you need to request it by
Add to Cart. The data can then be downloaded as a .csv file
which we will use to conduct our analyses. Be sure to double check the date
range before downloading.
On the options page, we want to make sure we select:
- Station Name
- Geographic Location (this gives us longitude & latitude; optional)
- Include Data Flags (this gives us information if the data are problematic)
- Units (Standard)
- Precipitation (w/ HPCP automatically checked)
On the next page you must enter an email address for the dataset to be sent to.
Step 3: Get the data
As this is a small dataset, it won't take long for you to get an email telling
you the dataset is ready. Follow the link in the email to download your dataset.
You can also view documentation (metadata) for the data.
Each data subset is downloaded with a unique order number. The order number in our example dataset is 805325. If you are using a dataset you've downloaded yourself, make sure to substitute in your own order number in the code below.
To ensure that we remember what our data file is, we've added a descriptor to
the order number:
805325-precip_daily_2003-2013. You may wish to do the same.
Work with Precipitation Data
We will use
ggplot2 to efficiently plot our data and
plotly to create i
# set your working directory #setwd("working-dir-path-here") # load packages library(ggplot2) # create efficient, professional plots library(plotly) # create cool interactive plots ## Warning: package 'plotly' was built under R version 3.4.1 ## ## Attaching package: 'plotly' ## The following object is masked from 'package:ggplot2': ## ## last_plot ## The following object is masked from 'package:stats': ## ## filter ## The following object is masked from 'package:graphics': ## ## layout
Import Precipitation Data
We will use the
in this analysis. This dataset is the daily precipitation date from the COOP
station 050843 in Boulder, CO for 1 January 2003 through 31 December 2013.
Since the data format is a .csv, we can use
read.csv to import the data. After
we import the data, we can take a look at the first few lines using
which defaults to the first 6 rows, of the
data.frame. Finally, we can explore
the R object structure.
# import precip data into R data.frame precip.boulder <- read.csv("disturb-events-co13/precip/805325-precip_daily_2003-2013.csv", stringsAsFactors = FALSE, header = TRUE) # view first 6 lines of the data head(precip.boulder) ## STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE ## 1 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 20030101 01:00 ## 2 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 20030201 01:00 ## 3 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 20030202 19:00 ## 4 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 20030202 22:00 ## 5 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 20030203 02:00 ## 6 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 20030205 02:00 ## HPCP Measurement.Flag Quality.Flag ## 1 0.0 g ## 2 0.0 g ## 3 0.2 ## 4 0.1 ## 5 0.1 ## 6 0.1 # view structure of data str(precip.boulder) ## 'data.frame': 1840 obs. of 9 variables: ## $ STATION : chr "COOP:050843" "COOP:050843" "COOP:050843" "COOP:050843" ... ## $ STATION_NAME : chr "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" "BOULDER 2 CO US" ... ## $ ELEVATION : num 1650 1650 1650 1650 1650 ... ## $ LATITUDE : num 40 40 40 40 40 ... ## $ LONGITUDE : num -105 -105 -105 -105 -105 ... ## $ DATE : chr "20030101 01:00" "20030201 01:00" "20030202 19:00" "20030202 22:00" ... ## $ HPCP : num 0 0 0.2 0.1 0.1 ... ## $ Measurement.Flag: chr "g" "g" " " " " ... ## $ Quality.Flag : chr " " " " " " " " ...
About the Data
Viewing the structure of these data, we can see that different types of data are included in this file.
- STATION and STATION_NAME: Identification of the COOP station.
- ELEVATION, LATITUDE and LONGITUDE: The spatial location of the station.
- DATE: Gives the date in the format: YYYYMMDD HH:MM. Notice that DATE is
chr, meaning the data are interpreted as a character class and not as a date.
- HPCP: The total precipitation given in inches
(since we selected
Standardfor the units), recorded for the hour ending at the time specified by DATE. Importantly, the metadata (see below) notes that the value 999.99 indicates missing data. Also important, hours with no precipitation are not recorded.
- Measurement Flag: Indicates if there are any abnormalities with the measurement of the data. Definitions of each flag can be found in Table 2 of the documentation.
- Quality Flag: Indicates if there are any potential quality problems with the data. Definitions of each flag can be found in Table 3 of the documentation.
Additional information about the data, known as metadata, is available in the
PRECIP_HLY_documentation.pdf file that can be downloaded along with the data.
(Note, as of Sept. 2016, there is a mismatch in the data downloaded and the
documentation. The differences are in the units and missing data value:
inches/999.99 (standard) or millimeters/25399.75 (metric)).
Clean the Data
Before we can start plotting and working with the data we always need to check several important factors:
- data class: is R interpreting the data the way we expect it. The function
str()is an important tools for this.
- NoData Values: We need to know if our data contains a specific value that means "data are missing" and if this value has been assigned to NA in R.
As we've noted, the date field is in a character class. We can convert it to a date/time
class that will allow R to correctly interpret the data and allow us to easily
plot the data. We can convert it to a date/time class using
# convert to date/time and retain as a new field precip.boulder$DateTime <- as.POSIXct(precip.boulder$DATE, format="%Y%m%d %H:%M") # date in the format: YearMonthDay Hour:Minute # double check structure str(precip.boulder$DateTime) ## POSIXct[1:1840], format: "2003-01-01 01:00:00" "2003-02-01 01:00:00" ...
- For more information on date/time classes, see the NEON tutorial Dealing With Dates & Times in R - as.Date, POSIXct, POSIXlt.
We've also learned that missing values, also known as NoData
values, are labelled with the placeholder
999.99. Do we have any NoData values in our data?
# histogram - would allow us to see 999.99 NA values # or other "weird" values that might be NA if we didn't know the NA value hist(precip.boulder$HPCP)
Looking at the histogram, it looks like we have mostly low values (which makes sense) but a few values
up near 1000 -- likely 999.99. We can assign these entries to be
NA, the value that
R interprets as no data.
# assing NoData values to NA precip.boulder$HPCP[precip.boulder$HPCP==999.99] <- NA # check that NA values were added; # we can do this by finding the sum of how many NA values there are sum(is.na(precip.boulder)) ##  94
There are 94 NA values in our dataset. This is missing data.
- Do we need to worry about the missing data?
- Could they affect our analyses?
This depends on what questions we are asking. Here we are looking at general patterns in the data across 10 years. Since we have just over 3650 days in our entire dataset, missing 94 probably won't affect the general trends we are looking at.
Can you think of a research question where we would need to be concerned about the missing data?
Plot Precipitation Data
Now that we've cleaned up the data, we can view it. To do this we will plot using
ggplot() from the
#plot the data precPlot_hourly <- ggplot(data=precip.boulder, # the data frame aes(DateTime, HPCP)) + # the variables of interest geom_bar(stat="identity") + # create a bar graph xlab("Date") + ylab("Precipitation (Inches)") + # label the x & y axes ggtitle("Hourly Precipitation - Boulder Station\n 2003-2013") # add a title precPlot_hourly ## Warning: Removed 94 rows containing missing values (position_stack).
As we can see, plots of hourly date lead to very small numbers and is difficult to represent all information on a figure. Hint: If you can't see any bars on your plot you might need to zoom in on it.
Plots and comparison of daily precipitation would be easier to view.
Plot Daily Precipitation
There are several ways to aggregate the data.
If you only want to view the data plotted by date you need to create a column with only dates (no time) and then re-plot.
# convert DATE to a Date class # (this will strip the time, but that is saved in DateTime) precip.boulder$DATE <- as.Date(precip.boulder$DateTime, # convert to Date class format="%Y%m%d %H:%M") #DATE in the format: YearMonthDay Hour:Minute # double check conversion str(precip.boulder$DATE) ## Date[1:1840], format: "2003-01-01" "2003-02-01" "2003-02-03" "2003-02-03" "2003-02-03" ... precPlot_daily1 <- ggplot(data=precip.boulder, # the data frame aes(DATE, HPCP)) + # the variables of interest geom_bar(stat="identity") + # create a bar graph xlab("Date") + ylab("Precipitation (Inches)") + # label the x & y axes ggtitle("Daily Precipitation - Boulder Station\n 2003-2013") # add a title precPlot_daily1 ## Warning: Removed 94 rows containing missing values (position_stack).
R will automatically combine all data from the same day and plot it as one entry.
Daily Plots & Data
If you want to record the combined hourly data for each day, you need to create a new data frame to store the daily data. We can
aggregate() function to combine all the hourly data into daily data.
We will use the date class DATE field we created in the previous code for this.
# aggregate the Precipitation (PRECIP) data by DATE precip.boulder_daily <-aggregate(precip.boulder$HPCP, # data to aggregate by=list(precip.boulder$DATE), # variable to aggregate by FUN=sum, # take the sum (total) of the precip na.rm=TRUE) # if the are NA values ignore them # if this is FALSE any NA value will prevent a value be totalled # view the results head(precip.boulder_daily) ## Group.1 x ## 1 2003-01-01 0.0 ## 2 2003-02-01 0.0 ## 3 2003-02-03 0.4 ## 4 2003-02-05 0.2 ## 5 2003-02-06 0.1 ## 6 2003-02-07 0.1
So we now have daily data but the column names don't mean anything. We can
give them meaningful names by using the
names() function. Instead of naming the column of
precipitation values with the original
HPCP, let's call it
# rename the columns names(precip.boulder_daily)[names(precip.boulder_daily)=="Group.1"] <- "DATE" names(precip.boulder_daily)[names(precip.boulder_daily)=="x"] <- "PRECIP" # double check rename head(precip.boulder_daily) ## DATE PRECIP ## 1 2003-01-01 0.0 ## 2 2003-02-01 0.0 ## 3 2003-02-03 0.4 ## 4 2003-02-05 0.2 ## 5 2003-02-06 0.1 ## 6 2003-02-07 0.1
Now we can plot the daily data.
# plot daily data precPlot_daily <- ggplot(data=precip.boulder_daily, # the data frame aes(DATE, PRECIP)) + # the variables of interest geom_bar(stat="identity") + # create a bar graph xlab("Date") + ylab("Precipitation (Inches)") + # label the x & y axes ggtitle("Daily Precipitation - Boulder Station\n 2003-2013") # add a title precPlot_daily
Compare this plot to the plot we created using the first method. Are they the same?
R Tip: This manipulation, or aggregation, of data
can also be done with the package
plyr using the
Subset the Data
Instead of looking at the data for the full decade, let's now focus on just the 2 months surrounding the flood on 11-15 September. We'll focus on the window from 15 August to 15 October.
Just like aggregating, we can accomplish this by interacting with the larger plot through the graphical interface or by creating a subset of the data and protting it separately.
Subset Within Plot
To see only a subset of the larger plot, we can simply set limits for the
scale on the x-axis with
# First, define the limits -- 2 months around the floods limits <- as.Date(c("2013-08-15", "2013-10-15")) # Second, plot the data - Flood Time Period precPlot_flood <- ggplot(data=precip.boulder_daily, aes(DATE, PRECIP)) + geom_bar(stat="identity") + scale_x_date(limits=limits) + xlab("Date") + ylab("Precipitation (Inches)") + ggtitle("Precipitation - Boulder Station\n August 15 - October 15, 2013") precPlot_flood ## Warning: Removed 770 rows containing missing values (position_stack).
Now we can easily see the dramatic rainfall event in mid-September!
R Tip: If you are using a date-time class, instead
of just a date class, you need to use
Subset The Data
Now let's create a subset of the data and plot it.
# subset 2 months around flood precip.boulder_AugOct <- subset(precip.boulder_daily, DATE >= as.Date('2013-08-15') & DATE <= as.Date('2013-10-15')) # check the first & last dates min(precip.boulder_AugOct$DATE) ##  "2013-08-21" max(precip.boulder_AugOct$DATE) ##  "2013-10-11" # create new plot precPlot_flood2 <- ggplot(data=precip.boulder_AugOct, aes(DATE,PRECIP)) + geom_bar(stat="identity") + xlab("Time") + ylab("Precipitation (inches)") + ggtitle("Daily Total Precipitation \n Boulder Creek 2013") precPlot_flood2
Interactive Plots - Plotly
Let's turn our plot into an interactive Plotly plot.
# setup your plot.ly credentials; if not already set up #Sys.setenv("plotly_username"="your.user.name.here") #Sys.setenv("plotly_api_key"="your.key.here") # view plotly plot in R ggplotly(precPlot_flood2)
# publish plotly plot to your plot.ly online account when you are happy with it api_create(precPlot_flood2)
Challenge: Plot Precip for Boulder Station Since 1948
The Boulder precipitation station has been recording data since 1948. Use the steps above to create a plot of the full record of precipitation at this station (1948 - 2013). The full dataset takes considerable time to download, so we recommend you use the dataset provided in the compressed file ("805333-precip_daily_1948-2013.csv").
As an added challenge, aggregate the data by month instead of by day.
Units & Scale
If you are using a dataset downloaded before 2016, the units were in
hundredths of an inch, this isn't the most useful measure. You might want to
create a new column
PRECIP that contains the data from
HPCP converted to
# convert from 100th inch by dividing by 100 precip.boulder$PRECIP<-precip.boulder$HPCP/100 # view & check to make sure conversion occured head(precip.boulder) ## STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE HPCP ## 1 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-01-01 0.0 ## 2 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-01 0.0 ## 3 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-03 0.2 ## 4 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-03 0.1 ## 5 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-03 0.1 ## 6 COOP:050843 BOULDER 2 CO US 1650.5 40.03389 -105.2811 2003-02-05 0.1 ## Measurement.Flag Quality.Flag DateTime PRECIP ## 1 g 2003-01-01 01:00:00 0.000 ## 2 g 2003-02-01 01:00:00 0.000 ## 3 2003-02-02 19:00:00 0.002 ## 4 2003-02-02 22:00:00 0.001 ## 5 2003-02-03 02:00:00 0.001 ## 6 2003-02-05 02:00:00 0.001
PRECIP. Did we do the conversion correctly?