Skip to main content
NSF NEON | Open Data to Understand our Ecosystems logo

Main navigation

  • About Us
    • Overview
      • Spatial and Temporal Design
      • History
    • Vision and Management
    • Advisory Groups
      • Science, Technology & Education Advisory Committee
      • Technical Working Groups (TWGs)
    • FAQ
    • Contact Us
      • Field Offices
    • User Accounts
    • Staff

    About Us

  • Data & Samples
    • Data Portal
      • Explore Data Products
      • Data Availability Charts
      • Spatial Data & Maps
      • Document Library
      • API & GraphQL
      • Prototype Data
      • External Lab Data Ingest (restricted)
    • Samples & Specimens
      • Discover and Use NEON Samples
        • Sample Types
        • Sample Repositories
        • Sample Explorer
        • Megapit and Distributed Initial Characterization Soil Archives
        • Excess Samples
      • Sample Processing
      • Sample Quality
      • Taxonomic Lists
    • Collection Methods
      • Protocols & Standardized Methods
      • Airborne Remote Sensing
        • Flight Box Design
        • Flight Schedules and Coverage
        • Daily Flight Reports
          • AOP Flight Report Sign Up
        • Camera
        • Imaging Spectrometer
        • Lidar
      • Automated Instruments
        • Site Level Sampling Design
        • Sensor Collection Frequency
        • Instrumented Collection Types
          • Meteorology
          • Phenocams
          • Soil Sensors
          • Ground Water
          • Surface Water
      • Observational Sampling
        • Site Level Sampling Design
        • Sampling Schedules
        • Observation Types
          • Aquatic Organisms
            • Aquatic Microbes
            • Fish
            • Macroinvertebrates & Zooplankton
            • Periphyton, Phytoplankton, and Aquatic Plants
          • Terrestrial Organisms
            • Birds
            • Ground Beetles
            • Mosquitoes
            • Small Mammals
            • Soil Microbes
            • Terrestrial Plants
            • Ticks
          • Hydrology & Geomorphology
            • Discharge
            • Geomorphology
          • Biogeochemistry
          • DNA Sequences
          • Pathogens
          • Sediments
          • Soils
            • Soil Descriptions
    • Data Notifications
    • Data Guidelines and Policies
      • Acknowledging and Citing NEON
      • Publishing Research Outputs
      • Usage Policies
    • Data Management
      • Data Availability
      • Data Formats and Conventions
      • Data Processing
      • Data Quality
      • Data Product Revisions and Releases
        • Release 2021
        • Release 2022
        • Release 2023
      • NEON and Google
      • Externally Hosted Data

    Data & Samples

  • Field Sites
    • About Field Sites and Domains
    • Explore Field Sites
    • Site Management Data Product

    Field Sites

  • Impact
    • Observatory Blog
    • Case Studies
    • Spotlights
    • Papers & Publications
    • Newsroom
      • NEON in the News
      • Newsletter Archive
      • Newsletter Sign Up

    Impact

  • Resources
    • Getting Started with NEON Data & Resources
    • Documents and Communication Resources
      • Papers & Publications
      • Document Library
      • Outreach Materials
    • Code Hub
      • Code Resources Guidelines
      • Code Resources Submission
      • NEON's GitHub Organization Homepage
    • Learning Hub
      • Science Videos
      • Tutorials
      • Workshops & Courses
      • Teaching Modules
      • Faculty Mentoring Networks
      • Data Education Fellows
    • Research Support and Assignable Assets
      • Field Site Coordination
      • Letters of Support
      • Mobile Deployment Platforms
      • Permits and Permissions
      • AOP Flight Campaigns
      • Excess Samples
      • Assignable Assets FAQs
    • Funding Opportunities

    Resources

  • Get Involved
    • Advisory Groups
      • Science, Technology & Education Advisory Committee
      • Technical Working Groups
    • Upcoming Events
    • Past Events
    • NEON Ambassador Program
    • Collaborative Works
      • EFI-NEON Ecological Forecasting Challenge
      • NCAR-NEON-Community Collaborations
      • NEON Science Summit
      • NEON Great Lakes User Group
    • Community Engagement
    • Science Seminars and Data Skills Webinars
    • Work Opportunities
      • Careers
      • Seasonal Fieldwork
      • Postdoctoral Fellows
      • Internships
        • Intern Alumni
    • Partners

    Get Involved

  • My Account
  • Search

Search

Learning Hub

  • Science Videos
  • Tutorials
  • Workshops & Courses
  • Teaching Modules
  • Faculty Mentoring Networks
  • Data Education Fellows

Breadcrumb

  1. Resources
  2. Learning Hub
  3. Tutorials
  4. Unsupervised Spectral Classification in Python: KMeans & PCA

Tutorial

Unsupervised Spectral Classification in Python: KMeans & PCA

Authors: Bridget Hass

Last Updated: Apr 1, 2021

In this tutorial, we will use the Spectral Python (SPy) package to run KMeans and Principal Component Analysis unsupervised classification algorithms.

Objectives

After completing this tutorial, you will be able to:

  • Classify spectral remote sensing data.

Install Python Packages

  • numpy
  • gdal
  • matplotlib
  • matplotlib.pyplot

Download Data

This tutorial uses a 1km AOP Hyperspectral Reflectance 'tile' from the SERC site. Download the spectral classification teaching data subset here.

Download Dataset

In this tutorial, we will use the Spectral Python (SPy) package to run KMeans and Principal Component Analysis unsupervised classification algorithms.

To learn more about the Spcral Python packages read:

  • Spectral Python User Guide.
  • Spectral Python Unsupervised Classification.

KMeans Clustering

KMeans is an iterative clustering algorithm used to classify unsupervised data (eg. data without a training set) into a specified number of groups. The algorithm begins with an initial set of randomly determined cluster centers. Each pixel in the image is then assigned to the nearest cluster center (using distance in N-space as the distance metric) and each cluster center is then re-computed as the centroid of all pixels assigned to the cluster. This process repeats until a desired stopping criterion is reached (e.g. max number of iterations).

Read more on KMeans clustering from Spectral Python.

To visualize how the algorithm works, it's easier look at a 2D data set. In the example below, watch how the cluster centers shift with progressive iterations,

KMeans clustering demonstration Source: Sandipan Deyn

Principal Component Analysis (PCA) - Dimensionality Reduction

Many of the bands within hyperspectral images are often strongly correlated. The principal components transformation represents a linear transformation of the original image bands to a set of new, uncorrelated features. These new features correspond to the eigenvectors of the image covariance matrix, where the associated eigenvalue represents the variance in the direction of the eigenvector. A very large percentage of the image variance can be captured in a relatively small number of principal components (compared to the original number of bands).

Read more about PCA with Spectral Python.

Set up

To run this notebook, the following Python packages need to be installed. You can install required packages from command line pip install spectra scikit-learn cvxopt.

or if already in a Jupyter Notebook, run the following code in a Notebook code cell.

Packages:

  • pylab
  • spectral
  • scikit-learn (optional)
import sys
!{sys.executable} -m pip install spectral
!conda install --yes --prefix {sys.prefix} scikit-learn
!conda install --yes --prefix {sys.prefix} cvxopt 

In order to make use of the interactive graphics capabilities of spectralpython, such as N-Dimensional Feature Display, you work in a Python 3.6 environment (as of July 2018).

For more, read from Spectral Python.

Optional:

matplotlib wx backend (for 3-D visualization of PCA, requires Python 3.6) Find out more on StackOverflow.

conda install -c newville wxpython-phoenix

Managing Conda Environments

  • nb_conda_kernels package provides a separate jupyter kernel for each conda environment
  • Find out more on Conda docs.
conda install -c conda-forge nb_conda_kernels

First, import the required packages and set display preferences:

from spectral import *
import spectral.io.envi as envi
import numpy as np
import matplotlib

#for clean output, to not print warnings, don't use when developing script
import warnings
warnings.filterwarnings('ignore')

For this example, we will read in a reflectance tile in ENVI format. NEON provides an h5 plugin for ENVI

# You will need to download the example dataset above,
# extract the files therein,
# and update the filepaths below per your local machine
img = envi.open('/Users/olearyd/Git/data/NEON_D02_SERC_DP3_368000_4306000_reflectance.hdr',
                '/Users/olearyd/Git/data/NEON_D02_SERC_DP3_368000_4306000_reflectance.dat')

Note that the information is stored differently when read in with envi.open. We can find the wavelength information in img.bands.centers. Let's take a look at the first and last wavelengths values:

print('First 3 Band Center Wavelengths:',img.bands.centers[:3])
print('Last 3 Band Center Wavelengths:',img.bands.centers[-3:])
First 3 Band Center Wavelengths: [383.534302, 388.542206, 393.55011]
Last 3 Band Center Wavelengths: [2501.878906, 2506.886719, 2511.894531]

We'll set the Water Vapor Band windows to NaN:

img.bands.centers[191:211]==np.nan
img.bands.centers[281:314]==np.nan
img.bands.centers[-10:]==np.nan
False

To get a quick look at the img data, use the params method:

img.params
<bound method SpyFile.params of 	Data Source:   '/Users/olearyd/Git/data/NEON_D02_SERC_DP3_368000_4306000_reflectance.dat'
	# Rows:           1000
	# Samples:        1000
	# Bands:           426
	Interleave:        BIP
	Quantization:  16 bits
	Data format:     int16>

Metadata information is stored in img.metadata, a dictionary. Let's look at the metadata contents:

md = img.metadata
print('Metadata Contents:')
for item in md:
    print('\t',item)
Metadata Contents:
	 description
	 samples
	 lines
	 bands
	 data type
	 interleave
	 file type
	 header offset
	 byte order
	 map info
	 coordinate system string
	 wavelength
	 fwhm
	 wavelength units
	 reflectance scale factor
	 data ignore value
	 dataset names

To access any of these metadata items, use the syntax md['description'] or md['map info']:

print('description:',md['description'])
print('map info:',md['map info'])
description: Atmospherically corrected reflectance.
map info: ['UTM', '1.000', '1.000', '368000.000', '4307000.000', '1.000000e+000', '1.000000e+000', '18', 'North', 'WGS-84', 'units=Meters']

You can also use type and len to look at the type and length (or number) of some of the metadata contents:

print(type(md['wavelength']))
print('Number of Bands:',len(md['wavelength']))
<class 'list'>
Number of Bands: 426

Let's look at the data using imshow, a wrapper around matplotlib's imshow for multi-band images:

view = imshow(img,bands=(58,34,19),stretch=0.05,title="RGB Image of 2017 SERC Tile")
print(view)
ImageView object:
  Display bands       :  (58, 34, 19)
  Interpolation       :  <default>
  RGB data limits     :
    R: [0.0058, 0.1471]
    G: [0.0184, 0.133]
    B: [0.0086, 0.1099]

png

When dealing with NEON hyperspectral data, we first want to remove the water vapor & noisy bands, keeping only the valid bands. To speed up the classification algorithms for demonstration purposes, we'll look at a subset of the data using read_subimage, a built in method to subset by area and bands. Type help(img.read_subimage) to see how it works.

valid_band_range = [i for j in (range(0,191), range(212, 281), range(315,415)) for i in j] #remove water vapor bands
img_subset = img.read_subimage(range(400,600),range(400,600),bands=valid_band_range) #subset image by area and bands

Plot the subsetted image for reference:

view = imshow(img_subset,bands=(58,34,19),stretch=0.01,title="RGB Image of 2017 SERC Tile Subset")

png

Now that we have the image subsetted, lets run the k-means algorithm. Type help(kmeans) to show how the function works. To run the k-means algorithm on the image and create 5 clusters, using a maximum of 50 iterations, use the following syntax:

(m,c) = kmeans(img_subset,5,50) 
spectral:INFO: k-means iteration 1 - 27267 pixels reassigned.
spectral:INFO: k-means iteration 2 - 3969 pixels reassigned.
spectral:INFO: k-means iteration 3 - 1690 pixels reassigned.
spectral:INFO: k-means iteration 4 - 1141 pixels reassigned.
spectral:INFO: k-means iteration 5 - 811 pixels reassigned.
spectral:INFO: k-means iteration 6 - 440 pixels reassigned.
spectral:INFO: k-means iteration 7 - 236 pixels reassigned.
spectral:INFO: k-means iteration 8 - 143 pixels reassigned.
spectral:INFO: k-means iteration 9 - 87 pixels reassigned.
spectral:INFO: k-means iteration 10 - 46 pixels reassigned.
spectral:INFO: k-means iteration 11 - 20 pixels reassigned.
spectral:INFO: k-means iteration 12 - 5 pixels reassigned.
spectral:INFO: k-means iteration 13 - 2 pixels reassigned.
spectral:INFO: k-means iteration 14 - 1 pixels reassigned.
spectral:INFO: k-means iteration 15 - 0 pixels reassigned.
spectral:INFO: kmeans terminated with 5 clusters after 14 iterations.

Note that the algorithm terminated afte 14 iterations, when the pixels stopped being reassigned.

Data Tip: You can iterrupt the algorithm with a keyboard interrupt (CTRL-C) if you notice that the number of reassigned pixels drops off. Kmeans catches the KeyboardInterrupt exception and returns the clusters generated at the end of the previous iteration. If you are running the algorithm interactively, this feature allows you to set the max number of iterations to an arbitrarily high number and then stop the algorithm when the clusters have converged to an acceptable level. If you happen to set the max number of iterations too small (many pixels are still migrating at the end of the final iteration), you can simply call kmeans again to resume processing by passing the cluster centers generated by the previous call as the optional start_clusters argument to the function.

Let's take a look at the cluster centers c. In this case, these represent spectras of the five clusters of reflectance that the data were grouped into.

print(c.shape)
(5, 360)

c contains 5 groups of spectral curves with 360 bands (the # of bands we've kept after removing the water vapor windows and the last 10 noisy bands). Let's plot these spectral classes:

%matplotlib inline
import pylab
pylab.figure()
for i in range(c.shape[0]):
    pylab.plot(c[i])
pylab.show
pylab.title('Spectral Classes from K-Means Clustering')
pylab.xlabel('Bands (with Water Vapor Windows Removed)')
pylab.ylabel('Reflectance')
Text(0, 0.5, 'Reflectance')

png

#%matplotlib notebook
view = imshow(img_subset, bands=(58,34,19),stretch=0.01, classes=m)
view.set_display_mode('overlay')
view.class_alpha = 0.5 #set transparency
view.show_data
<bound method ImageView.show_data of ImageView object:
  Display bands       :  (58, 34, 19)
  Interpolation       :  <default>
  RGB data limits     :
    R: [0.0055, 0.0617]
    G: [0.0184, 0.0893]
    B: [0.0083, 0.0503]
>

png

Challenges: K-Means

  1. What do you think the spectral classes in the figure you just created represent?
  2. Try using a different number of clusters in the kmeans algorithm (e.g., 3 or 10) to see what spectral classes and classifications result.

Principal Component Analysis (PCA)

Many of the bands within hyperspectral images are often strongly correlated. The principal components transformation represents a linear transformation of the original image bands to a set of new, uncorrelated features. These new features correspond to the eigenvectors of the image covariance matrix, where the associated eigenvalue represents the variance in the direction of the eigenvector. A very large percentage of the image variance can be captured in a relatively small number of principal components (compared to the original number of bands) .

pc = principal_components(img_subset)
pc_view = imshow(pc.cov)
xdata = pc.transform(img_subset)

png

In the covariance matrix display, lighter values indicate strong positive covariance, darker values indicate strong negative covariance, and grey values indicate covariance near zero.

pcdata = pc.reduce(num=10).transform(img_subset)
pc_0999 = pc.reduce(fraction=0.999)

# How many eigenvalues are left?
print(len(pc_0999.eigenvalues))

img_pc = pc_0999.transform(img_subset)
print(img_pc.shape)

v = imshow(img_pc[:,:,:5], stretch_all=True)
5
(200, 200, 5)

png

Get Lesson Code

classification_kmeans_pca_py.ipynb

Questions?

If you have questions or comments on this content, please contact us.

Contact Us
NEON Logo

Follow Us:

Join Our Newsletter

Get updates on events, opportunities, and how NEON is being used today.

Subscribe Now

Footer

  • My Account
  • About Us
  • Newsroom
  • Contact Us
  • Terms & Conditions
  • Careers

Copyright © Battelle, 2019-2020

The National Ecological Observatory Network is a major facility fully funded by the National Science Foundation.

Any opinions, findings and conclusions or recommendations expressed in this material do not necessarily reflect the views of the National Science Foundation.