Intro to Working with Hyperspectral Remote Sensing Data in HDF5 Format in R

Leah A. Wasser, Edmund Hart
Table of Contents

In this tutorial, we will explore reading and extracting spatial raster data stored within a HDF5 file using R.

Learning Objectives

After completing this tutorial, you will be able to:

  • Explain how HDF5 data can be used to store spatial data and the associated benefits of this format when working with large spatial data cubes.
  • Extract metadata from HDF5 files.
  • Slice or subset HDF5 data. You will extract one band of pixels.
  • Plot a matrix as an image and a raster.
  • Export a final GeoTIFF (spatially projected) that can be used both in further analysis and in common GIS tools like QGIS.

Things You’ll Need To Complete This Tutorial

To complete this tutorial you will need the most current version of R and, preferably, RStudio loaded on your computer.

R Libraries to Install:

  • rhdf5: source(""), biocLite("rhdf5")
  • raster: install.packages('raster')
  • rgdal: install.packages('rgdal')

More on Packages in R - Adapted from Software Carpentry.

Data to Download

Download NEON Teaching Data Subset: Imaging Spectrometer Data - HDF5

These hyperspectral remote sensing data provide information on the National Ecological Observatory Network's San Joaquin Exerimental Range field site. The data were collected over the San Joaquin field site located in California (Domain 17) and processed at NEON headquarters. The entire dataset can be accessed by request from the NEON Data Portal.

Download Dataset

Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

About Hyperspectral Remote Sensing Data

The electromagnetic spectrum is composed of thousands of bands representing different types of light energy. Imaging spectrometers (instruments that collect hyperspectral data) break the electromagnetic spectrum into groups of bands that support classification of objects by their spectral properties on the earth's surface. Hyperspectral data consists of many bands - up to hundreds of bands - that cover the electromagnetic spectrum.

The NEON imaging spectrometer (NIS) collects data within the 380 nm to 2510 nm portions of the electromagnetic spectrum within bands that are approximately 5 nm in width. This results in a hyperspectral data cube that contains approximately 428 bands - which means BIG DATA.

A data cube of NEON hyperspectral data. Each layer in the cube represents a band.

The HDF5 data model natively compresses data stored within it (makes it smaller) and supports data slicing (extracting only the portions of the data that you need to work with rather than reading the entire dataset into memory). These features in addition to the ability to support spatial data and associated metadata make it ideal for working with large data cubes such as those generated by imaging spectrometers.

In this tutorial we will explore reading and extracting spatial raster data stored within a HDF5 file using R.

Read HDF5 data into R

We will use the raster and rhdf5 packages to read in the HDF5 file that contains hyperspectral data for the NEON San Joaquin (SJER) field site. Let's start by calling the needed packages and reading in our NEON HDF5 file.

Please be sure that you have at least version 2.10 of rhdf5 installed. Use: packageVersion("rhdf5") to check the package version.

# Load `raster` and `rhdf5` packages and read NIS data into R

# set working directory to ensure R can find the file we wish to import and where
# we want to save our files

Data Tip: To update all packages installed in R, use update.packages().

# Define the file name to be opened
f <- 'NEON-DS-Imaging-Spectrometer-Data.h5'
# look at the HDF5 file structure 

##   group        name         ltype corder_valid corder cset       otype
## 0     / Reflectance H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET
## 1     /        fwhm H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET
## 2     /    map info H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET
## 3     / spatialInfo H5L_TYPE_HARD        FALSE      0    0   H5I_GROUP
## 4     /  wavelength H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET
##   num_attrs  dclass          dtype  stype rank             dim
## 0         6 INTEGER  H5T_STD_I16LE SIMPLE    3 477 x 502 x 426
## 1         2   FLOAT H5T_IEEE_F32LE SIMPLE    2         426 x 1
## 2         1  STRING     HST_STRING SIMPLE    1               1
## 3        11                                  0                
## 4         2   FLOAT H5T_IEEE_F32LE SIMPLE    2         426 x 1
##            maxdim
## 0 477 x 502 x 426
## 1         426 x 1
## 2               1
## 3                
## 4         426 x 1

When you look at the structure of the data, take note of the map info dataset, the spatialInfo group, and the wavelength and Reflectance datasets. The spatialInfo folder contains the spatial attributes of the data including its Coordinate Reference System (CRS). The CRS documents how the data are physically location on the earth. The wavelength dataset contains the middle wavelength values for each band in the data. The reflectance dataset contains the image data that we will use for both data processing and visualization.

More information on raster metadata:

Data Tip - HDF5 Structure: Note that the structure of individual HDF5 files may vary depending on who produced the data. In this case, the Wavelength and reflectance data within the file are both datasets. However the spatial information is contained within a group. Data downloaded from another organization like NASA, may look different. This is why it's important to explore the data before diving into using it!

We can use the h5readAttributes function to read and extract metadata from the HDF5 file. Let's start by reading in the spatial information.

# get spatialInfo using the h5readAttributes function 
spInfo <- h5readAttributes(f,"spatialInfo")

# get attributes for the Reflectance dataset
reflInfo <- h5readAttributes(f,"Reflectance")

Next, let's read in the wavelength center associated with each band in the HDF5 file.

# read in the wavelength information from the HDF5 file
wavelengths<- h5read(f,"wavelength")

Which wavelength is band 19 associated with? (Hint: look at the wavelengths vector that we just imported and check out the data located at index 19 - wavelengths[19]).

472 nanometers falls within the blue portion of the electromagnetic spectrum. Source: National Ecological Observatory Network

Band 19 has a associate wavelength center or 0.47244 which is in micrometers. This value equates to 472.44 nanometers (nm) which is in the visible blue portion of the electromagnetic spectrum (~ 400-700 nm).

Bands and Wavelengths

A band represents a group of wavelengths. For example, the wavelength values between 800nm and 805nm might be one band as captured by an imaging spectrometer. The imaging spectrometer collects reflected light energy in a pixel for light in that band. Often when you work with a multi or hyperspectral dataset, the band information is reported as the center wavelength value. This value represents the center point value of the wavelengths represented in that band. Thus in a band spanning 800-805 nm, the center would be 802.5 nm). The full width half max (FWHM) will also be reported. This value represents the spread of the band around that center point. So, a band that covers 800 nm-805 nm might have a FWHM of 2.5 and a wavelength value of 802.5.

Bands represent a range of values (types of light) within the electromagnetic spectrum. Values for each band are often represented as the center point value of each band.

The HDF5 dataset that we are working with in this activity contains more information than we need to work with. For example, we don't necessarily need to process all 426 bands - if we are interested in creating a product like NDVI which only users bands in the near infra-red and red portions of the spectrum. Or we might only be interested in a spatial subset of the data - perhaps a region where we have plots in the field.

The HDF5 format allows us to slice (or subset) the data - quickly extracting the subset that we need to process. Let's extract one of the green bands in our dataset - band 34. By the way - what is the center wavelength value associated with band 34? hint wavelengths[34]. How do we know this band is a green band in the visible portion of the spectrum?

# note that we can grab the dimensions of the dataset from the attributes
# we can then use that information to slice out our band data
nRows <- reflInfo$row_col_band[1]
nCols <- reflInfo$row_col_band[2]
nBands <- reflInfo$row_col_band[3]


## [1] 502


## [1] 477


## [1] 426

The HDF5 read function reads data in the order: Cols, Rows and bands. This is different from how R reads data (rows, columns, bands). We'll adjust for this later.

# Extract or "slice" data for band 34 from the HDF5 file
b34<- h5read(f,"Reflectance",index=list(1:nCols,1:nRows,34))

# what type of object is b34?

## [1] "array"

A Note About Data Slicing in HDF5

Data slicing allows us to extract and work with subsets of the data rather than reading in the entire dataset into memory. Thus, in this case, we can extract and plot the green band without reading in all 426 bands of information. The ability to slice large datasets, makes HDF5 ideal for working with big data.

Next, let's convert our data from an array (more than 2 dimensions) to a matrix (just 2 dimensions). We need to have our data in a matrix format to plot it.

# convert from array to matrix
b34 <- b34[,,1]

# check it

## [1] "matrix"

Arrays vs. Matrices

Arrays are matrices with more than 2 dimensions. When we say dimension, we are talking about the "z" associated with the data (imagine a series of tabs in a spreadsheet). Put the other way: matrices are arrays with only 2 dimensions. Arrays can have any number of dimensions one, two, ten or more.

Here is a matrix that is 4 x 3 in size (4 rows and 3 columns):

Metric species 1 species 2
total number 23 45
average weight 14 5
average length 2.4 3.5
average height 32 12

Dimensions in Arrays

An array contains 1 or more dimensions in the "z" direction. For example, let's say that we collected this same set of species data for every day in a 30 day month. We might then have a matrix like the one above for each day for a total of 30 days making a 4 x 3 x 30 array (this dataset has more than 2 dimensions). More on R object types here.

Right: a matrix has only 2 dimensions. Left: an array has more than 2 dimensions.

Next, let's look at the metadata for the reflectance data. When we do this, take note of 1) the scale factor and 2) the data ignore value. Then we can plot the band 34 data. Plotting spatial data as a visual "data check" is a good idea to make sure processing is being performed correctly and all is well with the image.

# look at the metadata for the reflectance dataset

## [1] "Wavelength" "Line"       "Sample"    
## $Description
## [1] "Atmospherically corrected reflectance."
## $`Scale Factor`
## [1] 10000
## $Unit
## [1] "unitless. Valid range 0-1."
## $`data ignore value`
## [1] "15000"
## $row_col_band
## [1] 502 477 426

# plot the image

# oh, that doens't tell us much
# what happens if we plot a log of the data?

What do you notice about the first image? It's a bit dark and lacking any detail. What could be causing this? It got better when plotting the log of the values, but still not great.

Let's look at the distribution of reflectance values in our data to figure out what is going on.

# Plot range of reflectance values as a histogram to view range
# and distribution of values.

# View values between 0 and 5000
hist(b34,breaks=40,col="darkmagenta",xlim = c(0, 5000))

# View higher values
hist(b34, breaks=40,col="darkmagenta",xlim = c(5000, 15000),ylim=c(0,100))

As you're examining the histograms above, keep in mind that reflectance values range between 0-1. The data scale factor in the metadata tells us to divide all reflectance values by 10,000. Thus, a value of 5,000 equates to a reflectance value of 0.50. Storing data as integers (without decimal places) compared to floating points (with decimal places) creates a smaller file. You will see this done often when working with remote sensing data.

Notice in the data that there are some larger reflectance values (>5,000) that represent a smaller number of pixels. These pixels are skewing how the image renders.

Data Ignore Value

Image data in raster format will often contain a data ignore value and a scale factor. The data ignore value represents pixels where there are no data. Among other causes, no data values may be attributed to the sensor not collecting data in that area of the image or to processing results which yield null values.

Remember that the metadata for the Reflectance dataset designated 15,000 as data ignore value. Thus, let's set all pixels with a value == 15,000 to NA (no value). If we do this, R won't try to render these pixels.

# there is a no data value in our raster - let's define it
myNoDataValue <- as.numeric(reflInfo$`data ignore value`)

## [1] 15000

# set all values greater than 15,000 to NA
b34[b34 == myNoDataValue] <- NA

# plot the image now

Reflectance Values and Image Stretch

Our image still looks dark because R is trying to render all reflectance values between 0 and 14999 as if they were distributed equally in the histogram. However we know they are not distributed equally. There are many more values between 0-5000 then there are values >5000.

Images have a distribution of reflectance values. A typical image viewing program will render the values by distributing the entire range of reflectance values
across a range of "shades" that the monitor can render - between 0 and 255. However, often the distribution of reflectance values is not linear. For example, in the case of our data, most of the reflectance values fall between 0 and 0.5. Yet there are a few values >1 that are heavily impacting the way the image is drawn on our monitor. Imaging processing programs like ENVI, QGIS and ArcGIS (and even Adobe Photoshop) allow you to adjust the stretch of the image. This is similar to adjusting the contrast and brightness in Photoshop.

The proper way to adjust our data would be what's called an image stretch. We will learn how to stretch our image data, later. For now, let's plot the values as the log function on the pixel reflectance values to factor out those larger values.


The log applied to our image increases the contrast making it look more like an image. However, look at the images below. The top one is what our log adjusted image looks like when plotted. The bottom on is an RGB version of the same image. Notice a difference?

LEFT: The image as it should look. RIGHT: the image that we outputted from the code above. Notice a difference?

Transpose Image

Notice that there are three data dimensions for this file: Bands x Rows x Columns. However, when R reads in the dataset, it reads them as: Columns x Bands x Rows. The data are flipped. We can quickly transpose the data to correct for this using the t or transpose command in R.

The orientation is rotated in our log adjusted image. This is because R reads in matrices starting from the upper left hand corner. Whereas, most rasters read pixels starting from the lower left hand corner. In the next section, we will deal with this issue by creating a proper georeferenced (spatially located) raster in R. The raster format will read in pixels following the same methods as other GIS and imaging processing software like QGIS and ENVI do.

# We need to transpose x and y values in order for our 
# final image to plot properly
image(log(b34), main="Transposed Image")

Create a Georeferenced Raster

Next, we will create a proper raster using the b34 matrix. The raster format will allow us to define and manage:

  • Image stretch
  • Coordinate reference system & spatial reference
  • Resolution

It will also account for the orientation issue discussed above.

To create a raster in R, we need a few pieces of information, including:

  • The coordinate reference system (CRS)
  • The location of the first pixel (located in the lower left hand corner of the raster).
  • The resolution or size of each pixel in the data.

First, let's grab the spatial information that we need from the HDF5 file. The CRS and associated information that is needed is stored in the map info dataset.

The map info string looks something like this:

"UTM,1.000,1.000,256521.000,4112571.000,1.000000e+000, 1.000000e+000,11,North,WGS-84,units=Meters".

Notice that this information is separated by commas. We can use the strsplit command in R to extract each element into a vector. The elements are position 4 and 5 represent the lower left hand corner of the raster. We need this information to define the raster's extent.

# Populate the raster image extent value. 
# get the map info, split out elements
mapInfo<-h5read(f,"map info")

# Extract each element of the map info information 
# so we can extract the lower left hand corner coordinates.
mapInfo<-unlist(strsplit(mapInfo, ","))

# view the attributes in the map dataset

##  [1] "UTM"           "1.000"         "1.000"         "256521.000"   
##  [5] "4112571.000"   "1.000000e+000" "1.000000e+000" "11"           
##  [9] "North"         "WGS-84"        "units=Meters"

Next we define the extents of our raster. The extents will be used to calculate the raster's resolution. The lower left hand corner is located at mapInfo[4:5]. We can define the final raster dataset extent by adding the number of rows to the Y lower left hand corner coordinate and the number of columns in the Reflectance dataset to the X lower left hand corner coordinate.

The extent of a raster represents the spatial location of each corner. The coordinate units will be determined by the spatial projection/ coordinate reference system that the data are in. Source: National Ecological Observatory Network (NEON)

Learn more about raster attributes including extent, and coordinate reference systems here.

Define Raster CRS

We have defined the extent of our raster but we still need to define the Coordinate reference system (CRS) of the raster. To do that, we can first grab the CRS string from the HDF5 attributes. Then we can assign that CRS to the raster object.

# Create the projection in as object
myCRS <- spInfo$projdef

## [1] "+proj=utm  +zone=11N +ellps=WGS84 +datum=WGS84"

# define final raster with projection info 
# note that capitalization will throw errors on a MAC.
# if UTM is all caps it might cause an error!
b34r <- raster(b34, 


## class       : RasterLayer 
## dimensions  : 502, 477, 239454  (nrow, ncol, ncell)
## resolution  : 0.002096436, 0.001992032  (x, y)
## extent      : 0, 1, 0, 1  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=11N +ellps=WGS84 +datum=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       : layer 
## values      : 116, 15677  (min, max)

#let's have a look at our properly positioned raster. Take note of the 
#coordinates on the x and y axis.

      xlab = "UTM Easting", 
      ylab = "UTM Northing",
      main = "Properly Positioned Raster")

# grab resolution of raster as an object
res <- spInfo$xscale

## [1] 1

# Grab the UTM coordinates of the upper left hand corner of the raster

#grab the left side x coordinate (xMin)
xMin <- as.numeric(mapInfo[4]) 
#grab the top corner coordinate (yMax)
yMax <- as.numeric(mapInfo[5])


## [1] 256521


## [1] 4112571

# Calculate the lower right hand corner to define the full extent of the 
# raster. To do this we need the number of columns and rows in the raster
# and the resolution of the raster.

# note that you need to multiple the columns and rows by the resolution of 
# the data to calculate the proper extent!
xMax <- (xMin + (ncol(b34))*res)
yMin <- (yMax - (nrow(b34))*res) 


## [1] 256998


## [1] 4112069

# define the extent (left, right, top, bottom)
rasExt <- extent(xMin,xMax,yMin,yMax)


## class       : Extent 
## xmin        : 256521 
## xmax        : 256998 
## ymin        : 4112069 
## ymax        : 4112571

# assign the spatial extent to the raster
extent(b34r) <- rasExt

# look at raster attributes

## class       : RasterLayer 
## dimensions  : 502, 477, 239454  (nrow, ncol, ncell)
## resolution  : 1, 1  (x, y)
## extent      : 256521, 256998, 4112069, 4112571  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=11N +ellps=WGS84 +datum=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       : layer 
## values      : 116, 15677  (min, max)

We can adjust the colors of our raster too if we want.

#let's change the colors of our raster and adjust the zlims 

      xlab = "UTM Easting", 
      ylab = "UTM Northing",
      main= "Raster w Custom Colors",

We've now created a raster from band 34 reflectance data. We can export the data as a raster, using the writeRaster command.

#write out the raster as a geotiff


#It's always good practice to close the H5 connection before moving on!
#close the H5 file

Challenge: Work with Rasters

Try these three extensions on your own:

  1. Create rasters using other bands in the dataset.

  2. Vary the distribution of values in the image to mimic an image stretch. e.g. b34[b34 > 6000 ] <- 10000

  3. Use what you know to extract ALL of the reflectance values for ONE pixel rather than for an entire band. HINT: this will require you to pick an x and y value and then all values in the z dimension: aPixel<- h5read(f,"Reflectance",index=list(54,36,NULL)). Plot the spectra output.

Get Lesson Code: 

Dialog content.