Intro to Working with Hyperspectral Remote Sensing Data in HDF5 Format in R
In this tutorial, we will explore reading and extracting spatial raster data stored within a HDF5 file using R.
After completing this tutorial, you will be able to:
- Explain how HDF5 data can be used to store spatial data and the associated benefits of this format when working with large spatial data cubes.
- Extract metadata from HDF5 files.
- Slice or subset HDF5 data. You will extract one band of pixels.
- Plot a matrix as an image and a raster.
- Export a final GeoTIFF (spatially projected) that can be used both in further analysis and in common GIS tools like QGIS.
Things You’ll Need To Complete This Tutorial
To complete this tutorial you will need the most current version of R and, preferably, RStudio loaded on your computer.
R Libraries to Install:
Data to Download
These hyperspectral remote sensing data provide information on the National Ecological Observatory Network's San Joaquin Exerimental Range field site. The data were collected over the San Joaquin field site located in California (Domain 17) and processed at NEON headquarters. The entire dataset can be accessed by request from the NEON Data Portal.
Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.
R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.
About Hyperspectral Remote Sensing Data
The electromagnetic spectrum is composed of thousands of bands representing different types of light energy. Imaging spectrometers (instruments that collect hyperspectral data) break the electromagnetic spectrum into groups of bands that support classification of objects by their spectral properties on the earth's surface. Hyperspectral data consists of many bands - up to hundreds of bands - that cover the electromagnetic spectrum.
The NEON imaging spectrometer (NIS) collects data within the 380 nm to 2510 nm portions of the electromagnetic spectrum within bands that are approximately 5 nm in width. This results in a hyperspectral data cube that contains approximately 428 bands - which means BIG DATA.
The HDF5 data model natively compresses data stored within it (makes it smaller) and supports data slicing (extracting only the portions of the data that you need to work with rather than reading the entire dataset into memory). These features in addition to the ability to support spatial data and associated metadata make it ideal for working with large data cubes such as those generated by imaging spectrometers.
In this tutorial we will explore reading and extracting spatial raster data stored within a HDF5 file using R.
Read HDF5 data into R
We will use the
rhdf5 packages to read in the HDF5 file that
contains hyperspectral data for the
NEON San Joaquin (SJER) field site.
Let's start by calling the needed packages and reading in our NEON HDF5 file.
Please be sure that you have at least version 2.10 of
rhdf5 installed. Use:
packageVersion("rhdf5") to check the package version.
# Load `raster` and `rhdf5` packages and read NIS data into R library(raster) library(rhdf5) library(rgdal) # set working directory to ensure R can find the file we wish to import and where # we want to save our files #setwd("working-dir-path-here")
Data Tip: To update all packages installed in
# Define the file name to be opened f <- 'NEON-DS-Imaging-Spectrometer-Data.h5' # look at the HDF5 file structure h5ls(f,all=T) ## group name ltype corder_valid corder cset otype ## 0 / Reflectance H5L_TYPE_HARD FALSE 0 0 H5I_DATASET ## 1 / fwhm H5L_TYPE_HARD FALSE 0 0 H5I_DATASET ## 2 / map info H5L_TYPE_HARD FALSE 0 0 H5I_DATASET ## 3 / spatialInfo H5L_TYPE_HARD FALSE 0 0 H5I_GROUP ## 4 / wavelength H5L_TYPE_HARD FALSE 0 0 H5I_DATASET ## num_attrs dclass dtype stype rank dim ## 0 6 INTEGER H5T_STD_I16LE SIMPLE 3 477 x 502 x 426 ## 1 2 FLOAT H5T_IEEE_F32LE SIMPLE 2 426 x 1 ## 2 1 STRING HST_STRING SIMPLE 1 1 ## 3 11 0 ## 4 2 FLOAT H5T_IEEE_F32LE SIMPLE 2 426 x 1 ## maxdim ## 0 477 x 502 x 426 ## 1 426 x 1 ## 2 1 ## 3 ## 4 426 x 1
When you look at the structure of the data, take note of the
map info dataset,
spatialInfo group, and the
Reflectance datasets. The
spatialInfo folder contains the spatial attributes of the data including its
Coordinate Reference System (CRS). The CRS documents how the data are physically
location on the earth. The
wavelength dataset contains the middle wavelength
values for each band in the data. The reflectance dataset contains the image data
that we will use for both data processing and visualization.
More information on raster metadata:
Data Tip - HDF5 Structure: Note that the structure of individual HDF5 files may vary depending on who produced the data. In this case, the Wavelength and reflectance data within the file are both datasets. However the spatial information is contained within a group. Data downloaded from another organization like NASA, may look different. This is why it's important to explore the data before diving into using it!
We can use the
h5readAttributes function to read and extract metadata from the
HDF5 file. Let's start by reading in the spatial information.
# get spatialInfo using the h5readAttributes function spInfo <- h5readAttributes(f,"spatialInfo") # get attributes for the Reflectance dataset reflInfo <- h5readAttributes(f,"Reflectance")
Next, let's read in the wavelength center associated with each band in the HDF5 file.
# read in the wavelength information from the HDF5 file wavelengths<- h5read(f,"wavelength")
Which wavelength is band 19 associated with? (Hint: look at the wavelengths
vector that we just imported and check out the data located at index 19 -
Band 19 has a associate wavelength center or 0.47244 which is in micrometers. This value equates to 472.44 nanometers (nm) which is in the visible blue portion of the electromagnetic spectrum (~ 400-700 nm).
Bands and Wavelengths
A band represents a group of wavelengths. For example, the wavelength values between 800nm and 805nm might be one band as captured by an imaging spectrometer. The imaging spectrometer collects reflected light energy in a pixel for light in that band. Often when you work with a multi or hyperspectral dataset, the band information is reported as the center wavelength value. This value represents the center point value of the wavelengths represented in that band. Thus in a band spanning 800-805 nm, the center would be 802.5 nm). The full width half max (FWHM) will also be reported. This value represents the spread of the band around that center point. So, a band that covers 800 nm-805 nm might have a FWHM of 2.5 and a wavelength value of 802.5.
The HDF5 dataset that we are working with in this activity contains more information than we need to work with. For example, we don't necessarily need to process all 426 bands - if we are interested in creating a product like NDVI which only users bands in the near infra-red and red portions of the spectrum. Or we might only be interested in a spatial subset of the data - perhaps a region where we have plots in the field.
The HDF5 format allows us to slice (or subset) the data - quickly extracting the
subset that we need to process. Let's extract one of the green bands in our
dataset - band 34. By the way - what is the center wavelength value associated
with band 34? hint
wavelengths. How do we know this band is a green band
in the visible portion of the spectrum?
# note that we can grab the dimensions of the dataset from the attributes # we can then use that information to slice out our band data nRows <- reflInfo$row_col_band nCols <- reflInfo$row_col_band nBands <- reflInfo$row_col_band nRows ##  502 nCols ##  477 nBands ##  426
The HDF5 read function reads data in the order: Cols, Rows and bands. This is different from how R reads data (rows, columns, bands). We'll adjust for this later.
# Extract or "slice" data for band 34 from the HDF5 file b34<- h5read(f,"Reflectance",index=list(1:nCols,1:nRows,34)) # what type of object is b34? class(b34) ##  "array"
A Note About Data Slicing in HDF5
Data slicing allows us to extract and work with subsets of the data rather than reading in the entire dataset into memory. Thus, in this case, we can extract and plot the green band without reading in all 426 bands of information. The ability to slice large datasets, makes HDF5 ideal for working with big data.
Next, let's convert our data from an array (more than 2 dimensions) to a matrix (just 2 dimensions). We need to have our data in a matrix format to plot it.
# convert from array to matrix b34 <- b34[,,1] # check it class(b34) ##  "matrix"
Arrays vs. Matrices
Arrays are matrices with more than 2 dimensions. When we say dimension, we are talking about the "z" associated with the data (imagine a series of tabs in a spreadsheet). Put the other way: matrices are arrays with only 2 dimensions. Arrays can have any number of dimensions one, two, ten or more.
Here is a matrix that is 4 x 3 in size (4 rows and 3 columns):
|Metric||species 1||species 2|
Dimensions in Arrays
An array contains 1 or more dimensions in the "z" direction. For example, let's say that we collected this same set of species data for every day in a 30 day month. We might then have a matrix like the one above for each day for a total of 30 days making a 4 x 3 x 30 array (this dataset has more than 2 dimensions). More on R object types here.
Next, let's look at the metadata for the reflectance data. When we do this, take note of 1) the scale factor and 2) the data ignore value. Then we can plot the band 34 data. Plotting spatial data as a visual "data check" is a good idea to make sure processing is being performed correctly and all is well with the image.
# look at the metadata for the reflectance dataset h5readAttributes(f,"Reflectance") ## $DIMENSION_LABELS ##  "Wavelength" "Line" "Sample" ## ## $Description ##  "Atmospherically corrected reflectance." ## ## $`Scale Factor` ##  10000 ## ## $Unit ##  "unitless. Valid range 0-1." ## ## $`data ignore value` ##  "15000" ## ## $row_col_band ##  502 477 426 # plot the image image(b34)
# oh, that doens't tell us much # what happens if we plot a log of the data? image(log(b34))
What do you notice about the first image? It's a bit dark and lacking any detail. What could be causing this? It got better when plotting the log of the values, but still not great.
Let's look at the distribution of reflectance values in our data to figure out what is going on.
# Plot range of reflectance values as a histogram to view range # and distribution of values. hist(b34,breaks=40,col="darkmagenta")
# View values between 0 and 5000 hist(b34,breaks=40,col="darkmagenta",xlim = c(0, 5000))
# View higher values hist(b34, breaks=40,col="darkmagenta",xlim = c(5000, 15000),ylim=c(0,100))
As you're examining the histograms above, keep in mind that reflectance values range between 0-1. The data scale factor in the metadata tells us to divide all reflectance values by 10,000. Thus, a value of 5,000 equates to a reflectance value of 0.50. Storing data as integers (without decimal places) compared to floating points (with decimal places) creates a smaller file. You will see this done often when working with remote sensing data.
Notice in the data that there are some larger reflectance values (>5,000) that represent a smaller number of pixels. These pixels are skewing how the image renders.
Data Ignore Value
Image data in raster format will often contain a data ignore value and a scale factor. The data ignore value represents pixels where there are no data. Among other causes, no data values may be attributed to the sensor not collecting data in that area of the image or to processing results which yield null values.
Remember that the metadata for the
Reflectance dataset designated 15,000 as
data ignore value. Thus, let's set all pixels with a value == 15,000 to
(no value). If we do this, R won't try to render these pixels.
# there is a no data value in our raster - let's define it myNoDataValue <- as.numeric(reflInfo$`data ignore value`) myNoDataValue ##  15000 # set all values greater than 15,000 to NA b34[b34 == myNoDataValue] <- NA # plot the image now image(b34)
Reflectance Values and Image Stretch
Our image still looks dark because R is trying to render all reflectance values between 0 and 14999 as if they were distributed equally in the histogram. However we know they are not distributed equally. There are many more values between 0-5000 then there are values >5000.
Images have a distribution of reflectance values. A typical image viewing program
will render the values by distributing the entire range of reflectance values
across a range of "shades" that the monitor can render - between 0 and 255. However, often the distribution of reflectance values is not linear. For example, in the case of our data, most of the reflectance values fall between 0 and 0.5. Yet there are a few values >1 that are heavily impacting the way the image is drawn on our monitor. Imaging processing programs like ENVI, QGIS and ArcGIS (and even Adobe Photoshop) allow you to adjust the stretch of the image. This is similar to adjusting the contrast and brightness in Photoshop.
The proper way to adjust our data would be
what's called an
image stretch. We will learn how to stretch our image data,
later. For now, let's plot the values as the log function on the pixel
reflectance values to factor out those larger values.
The log applied to our image increases the contrast making it look more like an image. However, look at the images below. The top one is what our log adjusted image looks like when plotted. The bottom on is an RGB version of the same image. Notice a difference?
Notice that there are three data dimensions for this file: Bands x Rows x
Columns. However, when R reads in the dataset, it reads them as: Columns x
Bands x Rows. The data are flipped. We can quickly transpose the data to correct
for this using the
transpose command in R.
The orientation is rotated in our log adjusted image. This is because R reads in matrices starting from the upper left hand corner. Whereas, most rasters read pixels starting from the lower left hand corner. In the next section, we will deal with this issue by creating a proper georeferenced (spatially located) raster in R. The raster format will read in pixels following the same methods as other GIS and imaging processing software like QGIS and ENVI do.
# We need to transpose x and y values in order for our # final image to plot properly b34<-t(b34) image(log(b34), main="Transposed Image")
Create a Georeferenced Raster
Next, we will create a proper raster using the
b34 matrix. The raster
format will allow us to define and manage:
- Image stretch
- Coordinate reference system & spatial reference
It will also account for the orientation issue discussed above.
To create a raster in R, we need a few pieces of information, including:
- The coordinate reference system (CRS)
- The location of the first pixel (located in the lower left hand corner of the raster).
- The resolution or size of each pixel in the data.
First, let's grab the spatial information that we need from the HDF5 file.
The CRS and associated information that is needed is stored in the
The map info string looks something like this:
Notice that this information is separated by commas. We can use the
strsplit command in R to extract each element into a vector. The elements are
position 4 and 5 represent the lower left hand corner of the raster. We need
this information to define the raster's extent.
# Populate the raster image extent value. # get the map info, split out elements mapInfo<-h5read(f,"map info") # Extract each element of the map info information # so we can extract the lower left hand corner coordinates. mapInfo<-unlist(strsplit(mapInfo, ",")) # view the attributes in the map dataset mapInfo ##  "UTM" "1.000" "1.000" "256521.000" ##  "4112571.000" "1.000000e+000" "1.000000e+000" "11" ##  "North" "WGS-84" "units=Meters"
Next we define the extents of our raster. The extents will be used to calculate
the raster's resolution. The lower left hand corner is located at mapInfo[4:5].
We can define the final raster dataset extent by adding the number of rows to
the Y lower left hand corner coordinate and the number of columns in the
Reflectance dataset to the X lower left hand corner coordinate.
Define Raster CRS
We have defined the extent of our raster but we still need to define the Coordinate
reference system (
CRS) of the raster. To do that, we can first grab the CRS
string from the HDF5 attributes. Then we can assign that CRS to the raster object.
# Create the projection in as object myCRS <- spInfo$projdef myCRS ##  "+proj=utm +zone=11N +ellps=WGS84 +datum=WGS84" # define final raster with projection info # note that capitalization will throw errors on a MAC. # if UTM is all caps it might cause an error! b34r <- raster(b34, crs=myCRS) b34r ## class : RasterLayer ## dimensions : 502, 477, 239454 (nrow, ncol, ncell) ## resolution : 0.002096436, 0.001992032 (x, y) ## extent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=utm +zone=11N +ellps=WGS84 +datum=WGS84 +towgs84=0,0,0 ## data source : in memory ## names : layer ## values : 116, 15677 (min, max) #let's have a look at our properly positioned raster. Take note of the #coordinates on the x and y axis. image(log(b34r), xlab = "UTM Easting", ylab = "UTM Northing", main = "Properly Positioned Raster")
# grab resolution of raster as an object res <- spInfo$xscale res ##  1 # Grab the UTM coordinates of the upper left hand corner of the raster #grab the left side x coordinate (xMin) xMin <- as.numeric(mapInfo) #grab the top corner coordinate (yMax) yMax <- as.numeric(mapInfo) xMin ##  256521 yMax ##  4112571 # Calculate the lower right hand corner to define the full extent of the # raster. To do this we need the number of columns and rows in the raster # and the resolution of the raster. # note that you need to multiple the columns and rows by the resolution of # the data to calculate the proper extent! xMax <- (xMin + (ncol(b34))*res) yMin <- (yMax - (nrow(b34))*res) xMax ##  256998 yMin ##  4112069 # define the extent (left, right, top, bottom) rasExt <- extent(xMin,xMax,yMin,yMax) rasExt ## class : Extent ## xmin : 256521 ## xmax : 256998 ## ymin : 4112069 ## ymax : 4112571 # assign the spatial extent to the raster extent(b34r) <- rasExt # look at raster attributes b34r ## class : RasterLayer ## dimensions : 502, 477, 239454 (nrow, ncol, ncell) ## resolution : 1, 1 (x, y) ## extent : 256521, 256998, 4112069, 4112571 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=utm +zone=11N +ellps=WGS84 +datum=WGS84 +towgs84=0,0,0 ## data source : in memory ## names : layer ## values : 116, 15677 (min, max)
We can adjust the colors of our raster too if we want.
#let's change the colors of our raster and adjust the zlims col=terrain.colors(25) image(b34r, xlab = "UTM Easting", ylab = "UTM Northing", main= "Raster w Custom Colors", col=col, zlim=c(0,3000))
We've now created a raster from band 34 reflectance data. We can export the data
as a raster, using the
#write out the raster as a geotiff writeRaster(b34r, file="band34.tif", format="GTiff", overwrite=TRUE) #It's always good practice to close the H5 connection before moving on! #close the H5 file H5close()
Challenge: Work with Rasters
Try these three extensions on your own:
Create rasters using other bands in the dataset.
Vary the distribution of values in the image to mimic an image stretch. e.g.
b34[b34 > 6000 ] <- 10000
Use what you know to extract ALL of the reflectance values for ONE pixel rather than for an entire band. HINT: this will require you to pick an x and y value and then all values in the z dimension:
aPixel<- h5read(f,"Reflectance",index=list(54,36,NULL)). Plot the spectra output.