This manual introduces the Habitat Sampler (HaSa), an innovative tool that autonomously generates representative reference samples for predictive modeling of surface class probabilities. The tool can be applied to any image data that displays surface structures and dynamics of any kind at multiple spatial and temporal scales. HaSa was initially developed to classify habitat dynamics in semi-natural ecosystems, but the procedure can theoretically be applied to any surface. The main innovation of the tool is that it reduces reliance on comprehensive in situ ground truth data or comprehensive training data sets which constrain accurate image classification particularly in complex scenes.
Though development of HaSa has prioritized ease of use, this documentation assumes a familiarity with the R software. The document is built successively and is intended to lead you step-by-step through the HaSa procedure of generating probability and classification maps. HaSa is still in development and any suggestions or improvements are welcomed and encouraged in our GitLab Community Version. If questions remain please don’t hesitate to contact the authors of the package. For a detailed description of the Habitat Sampler and its applications, see Neumann et al., (2020).
The tool is implemented in R and uses Leaflet (Cheng et al., 2019) to generate interactive maps in a web browser. There are no assumptions about the input image data and there are no constraints for the spectral-temporal-spatial domain in which the image is sampled. The tool requires the input of a priori expert user knowledge to generate reference data about expected surface classes which are delineated in the imagery or extracted from an external spectral library. The user has the choice between image classifiers random forest (RF) and support vector (SV).
The examples in this documentation use an L2A Sentinel-2 timeseries stack from 2018 (6 days, 9 bands each) and reference points that are included in the HaSa package. This Sentinel-2 data are from the Kyritz-Ruppiner Heide a former military training area northeast of Berlin, Germany. The open heathlands in the former military training area are designated protected areas under the European Natura 2000 network and are subject to management activities including tree removal, controlled burning and machine mowing. The reference data include 7 classes identified with a priori expert knowledge.
The Sentinel-2 data are downloaded and processed using the German Centre for Geosciences (GFZ) Time Series System for Sentinel-2 (GTS2). Data are made available and atmospherically corrected via a simple web API. Detailed information on the GTS2 system can be found here. The metadata of the Sentinel-2 data including the band ID in the timeseries stack are provided below (Table 1).
Band 2 | Band 3 | Band 4 | Band 5 | Band 6 | Band 7 | Band 8 | Band 11 | Band 12 | |
---|---|---|---|---|---|---|---|---|---|
Date | Blue | Green | Red | Red Edge 1 | Red Edge 2 | Red Edge 3 | NIR | SWIR 1 | SWIR 2 |
2018-03-03 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
2018-05-07 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
2018-06-06 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
2018-07-31 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 |
2018-09-09 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 |
2018-10-14 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
Example reference data are provided in two formats, as a data table and a point shapefile. The table includes spectral information from each class type desired for the classification and can be used directly to train the models (rows = class, columns = spectral bands). The first row must contain the spectral bands names and this must match the bands of the input image data.
The point shapefile contains a point location per class and is used to extract the reference data. In the event the shapefile does not have the same projection as the input image data, it will be automatically reprojected. The class names must be passed as a vector in the same order as the reference spectra (rows = class).
The following procedure will lead you through the preliminary steps required to setup the HaSa tool.
The installation assumes that R is installed in the user’s system, as
well, as certain system packages dependencies. If it is not the case,
the user should install the following packages (we only provide
information for a Debian
OS, such as
Ubuntu
):
# Install dependencies
-get install -y protobuf-compiler libprotobuf-dev proj-bin gdal-bin libgdal-dev
apt
# Install dev-dependencies
-get install -y jq libjq-dev libv8-dev pandoc libsodium-dev libsecret-1-dev
apt-dev fontconfig libfontconfig1-dev r-cran-lubridate qpdf ghostscript
libfreetype6
# Install R
-get install -y r-base apt
HaSa is not yet available as a CRAN package. The user needs to
install it directly from its repository. Since versions of certain
packages were pinned (c.f., Section 2.1), the user should use
upgrade = FALSE
when calling
remotes::install_git
. In the following example the user can
install the HaSa
R package, build its manual and its
vignettes.
library(remotes)
::install_git("https://git.gfz-potsdam.de/habitat-sampler/HabitatSampler.git",
remotesref = "master", subdir = "R-package/hasa", dependencies = NA, upgrade = FALSE,
build = TRUE, build_manual = TRUE, build_vignettes = TRUE)
The installation might take several minutes until it is concluded. To
speed up the process the user should skip the build of the documentation
and vignettes, and install dependencies concurrently
(threads = <num_of_cores - 1>
):
library(remotes)
::install_git("https://git.gfz-potsdam.de/habitat-sampler/HabitatSampler.git",
remotesref = "master", subdir = "R-package/hasa", dependencies = NA, upgrade = FALSE,
build = TRUE, build_manual = FALSE, build_vignettes = FALSE, quick = TRUE,
threads = 6)
Before the user starts using HaSa
it is necessary to
load the library and some of its dependencies. In the following example,
the libraries to be loaded are passed as a list. The option
options("rgdal_show_exportToProj4_warnings"="none")
is to
suppress the warning messages related with the latest changes in
gdal
and PROJ6
.
options(rgdal_show_exportToProj4_warnings = "none")
<- c("rgdal", "raster", "maptools", "randomForest", "e1071", "devtools",
libraries "fasterize", "rgeos", "leaflet", "htmlwidgets", "IRdisplay", "HaSa")
lapply(libraries, library, character.only = TRUE)
An important step preceding classification is to load the Sentinel-2
satellite timeseries stack, reference data, and class names.
HaSa
provides a set of functions to guide the user through
the data loading process.
Before loading the input data and using HaSa
, the user
needs to define a series of directory paths. They are from where
HaSa
will read input data, and store intermediates and
final results. These directory paths are relative to the working
directory path, i.e., wd
. The following code sets all the
paths assuming that the root path is the current directory, i.e., the
demo
directory.
<- paste(tools::file_path_as_absolute("./"), sep = "")
wd
# demo data stored here
<- paste(wd, "/data", sep = "")
dataPath
# output directory
<- paste(wd, "Results/", sep = "")
out_path
# temporary directory for the raster package
::rasterOptions(tmpdir = "./RasterTmp/") raster
The satellite time series is either passed as a
3.2.1 stack of images already clipped or
3.2.2 a stack of image to be clipped. In both cases,
the input satellite images needs to either have a valid projection or
the projection be passed as parameter, i.e.,
sat_crs_str = '+proj=utm +zone=32 +datum=WGS84 +units=m +no_defs'
,
otherwise, the function will report error. Satellite time series data
are available in dataPath
.
<- paste(dataPath, "/SentinelStack_2018.tif", sep = "")
satellite_series_path <- HaSa::load_timeseries_stack(satellite_series_path)
timeseries_stack
# Mask out NaNs in all layers
<- raster::mask(timeseries_stack, raster::calc(timeseries_stack,
timeseries_stack fun = sum))
# See Table 1 for band IDs and corresponding Sentinel-2 bands
= 19
r = 20
g = 21
b ::plotRGB(timeseries_stack, r = r, g = g, b = b, stretch = "lin", axes = T) raster
Reference samples include spectral information of the defined classes
as chosen by the user. Reference data can be either passed as a table
3.3.1 or a shapefile 3.3.2. Reference
samples passed as a table can either be pre-extracted spectral
information from an image or spectral information imported from a
spectral library. Reference samples passed as a shapefile are point
locations defined by the user on the satellite timeseries stack from
where the spectral information are automatically extracted for each
class. Sample reference data are available in dataPath
in
table and shapefile format.
The table includes spectral information from each class type (rows = classes, columns = spectral reflectance). The first row must contain the spectral wavebands, the same names as the ones used in the input satellite time series stack (Table 2).
# load reference table. Each row represents a unique target class.
<- paste(dataPath, "/Example_Reference_table.txt", sep = "")
table_data_path <- HaSa::load_reference_as_table(table_data_path) ref
SentinelStack_2018.1 | SentinelStack_2018.2 | SentinelStack_2018.3 | … | SentinelStack_2018.54 | |
---|---|---|---|---|---|
deciduous | 1066 | 1069 | 915 | … | 1725 |
coniferous | 656 | 687 | 444 | … | 1671 |
heather_young | 2071 | 2303 | 2227 | … | 2726 |
heather_old | 895 | 910 | 728 | … | 1413 |
heather_shrub | 889 | 927 | 792 | … | 1718 |
bare_ground | 2176 | 2335 | 2277 | … | 2139 |
xeric_grass | 3952 | 4566 | 4757 | … | 4893 |
The point shapefile contains a point location per class and is used
to automatically extract the reference data. The wavelengths for each
point are extracted from the Sentinel-2 timeseries stack using the R
routine raster::extract
. If the shapefile does not have the
same projection as the input Sentinel-2 stack, HaSa
will
automatically reproject it to match the projection of the Sentinel-2
stack. The resulting table has the following format (rows = classes,
columns = spectral wavebands).
<- paste(dataPath,"/Example_Reference_Points.shp", sep = "")
shp_path <- paste(dataPath,"/SentinelStack_2018.tif", sep = "")
satellite_series_path
<- HaSa::load_reference_as_shape(
ref
shp_path, satellite_series_path,shp_crs_str = '+proj=utm +zone=33 +datum=WGS84 +units=m +no_defs'
)
The class names should be passed as a vector in the same order as the reference spectra (rows = class).
# create vector with class names. The order of classes must follow the
# same order of reference spectra (row = class)
<- c("deciduous", "coniferous", "heather_young", "heather_old",
class_names "heather_shrub", "bare_ground", "xeric_grass")
The interactive mode of HaSa
requires the user’s
expertise to define a threshold for class extraction. With the help of
an interactive map the user chooses the threshold used to filter out the
pixels not belonging to a certain class. The interactive map includes an
RGB composite of one of the Sentinel-2 scenes to assist in class
extraction.
The satellite timeseries stack (SentinelStack_2018.tif
)
loaded in 3.2.1 has 6 scenes and each scene includes 9
bands (see Table 1). Using the clipped Sentinel-2 timeseries stack
provided as input, the user can test which bands should be used in the
plot using HaSa::plot_configuration()
. The variable
plot_rgb
will later be used for the interactive procedure
of class sampling.
<- paste(dataPath,"/Example_Reference_Points.shp", sep = "")
shp_path <- paste(dataPath,"/SentinelStack_2018.tif",sep = "")
satellite_series_path <- c("r" = 19, "g" = 20, "b" = 21)
plot_rgb
::plot_configuration(
HaSa
shp_path, satellite_series_path,
plot_rgb,shp_crs_str = '+proj=utm +zone=33 +datum=WGS84 +units=m +no_defs'
)
Light grey indicates a low probability and forest green a high probability.
<- colorRampPalette(c("lightgrey", "orange", "yellow", "limegreen", "forestgreen")) col
Class sampling input
::multi_Class_Sampling(
HaSain_raster = raster_stk, # clipped satellite time series stack [raster brick]
num_samples = 75, # starting number of spatial samples (recommended value: 75)
# for more info *See note 1
sample_type =
"regular_raster", # distribution of spatial samples
# ("random_raster","regular_raster", "matrix_random")
# recommended: "regular_raster" *See note 2
num_models = 200, # number of models to collect (recommended value: 200)
num_iterations = 10, # number of iterations for model accuracy
# (recommended value:10)
buffer = 15, # distance (in m) for new sample collection around initial
# samples (depends on pixel size and image resolution)
# (e.g. Sentinel: Pixelsize = 10m; sqrt(10²+10²)=14.142 => 15m)
reference = ref, # table of reference spectra [data.frame]
model = "rf", # which machine learning algorithm to use ("rf" random
# forest or "svm" support vector machine;
# recommended input: rf)
num_trees = 500, # if the model is "rf" set the number of trees for the
# random forest algorithm, otherwise, the parameter is
# ignored *See note 3
mtry = 8, # splitting nodes (recommended: mtry < number of predictors;
# formular: mtry = sqrt(number of layers in input raster);
# e.g. Demo Data: layers = 54; sqrt(54) = 7,35 => 8)
mod_error = 0.02, # threshold for model error until which iteration is being
last = FALSE, # only TRUE for one class classifier
# recommended input: FALSE *See note 4
last_ref_val = 500, # default reference value for the last step (default: 500)
# for more info *See note 5
seed =
as.integer(Sys.time()), # set seed for reproducible results *See note 6
init_seed = "sample", # "sample" for new or use Run@seeds to reproduce previous
# steps *See note 7
out_path = out_path, # output path for saving results
step = 1, # at which step should the procedure start, e.g. use
# step = 2 if the first habitat is already extracted
class_names = class_names,# vector with class names in the order of reference spectra
n_classes = 7, # total number of classes to be separated
multi_test = 1, # number of test runs to compare different probability
# output *See note 8
RGB = c(19,20,21), # pallette colors for the interactive plots
color =
c("lightgrey", "orange", "yellow", "limegreen", "forestgreen")
# single colors for continuous color palette interpolation
in_memory = TRUE, # boolean for raster processing (memory = "TRUE",
# from disk = "FALSE")
optimized_mode = TRUE # use the optimized mode (run in_memory if possible
# and use matrices instead of rasters)
overwrite = TRUE, # overwrite the KML and raster files from previous runs
save_runs = TRUE, # an class object is saved into disk for each run
# (default TRUE)
plot_on_browser = FALSE # plot on the browser or inline in a notebook (default TRUE)
)
num_samples
and
num_models
is not always the solution. The user should also
try to re-sample with a different seed
value.random_raster
(it uses raster::sampleRandom
),
raster_regular
(it uses
raster::sampleRegular
), and random_matrix
(it
uses matrices and the stats::sample
function over only
existent non NaN
pixels). The regular_raster
-> fast: preferable at the beginning of sampling procedure. The
random_raster
-> slow: it only samples pixels with
information and it is preferable to use at the final steps with few and
irregular distributed pixels. The random_matrix
-> fast:
it only samples pixels with information and it is preferable to use at
the final steps with few and irregular distributed pixels.last = T
can be
set when only one class should be separated from the background
pixelslast_ref_val
. In
case the user gets NA as last class, the user should adjust the value of
last_ref_val
and re-sample again.seed
should have a different value on each run (use
seed = as.numeric(Sys.time())
). To repeat a specific run
the user just needs to restart R, run again the
HaSa::multi_Class_Sampling
with the same arguments, this
is, keep seed
constant, and it will get the same results
for all the steps. The exception is random_raster
(it uses
raster::sampleRandom
). It will always produce slightly
different results even with a seed provided. This behaviour comes from
the package raster and can’t be changed.raster_regular
and
random_matrix
) when using the same seed value and
int.seed=Run@seeds
(e.g. Run02@seeds) in consequence, init.sample
for regular sampling determines an invariant sample distribution, use
random
sampling or vary init.sample
to get
varying sample distributions.multi_test > 1
the user
will get multiple maps and will be asked to enter the number of the
probability distribution that is appropriate.Habitat sampling output. R object that contains:
# selected classifiers
models # spatial points of selected samples (see WriteOutSamples.r)
ref_samples switch # the target class is [2] if switch is not NA then the target
# class must be changed from [1] to [2] (see WriteOutSamples.r)
# raster layer of habitat type probability
layer # all classifiers from num_models
mod_all # predictive distance metric for all classes
class_ind # seeds to reproduce respective step/habitat type sampling seeds
An interactive map is plotted in a web browser (e.g., Firefox for Linux) containing a selected habitat type. The number of models predicting this habitat type can be viewed by hovering the mouse over the map.
From this interactive map, the user has two choices:
If the user chooses to extract the class, a user defined threshold is entered into the R console and the following files are saved:
After class extraction is done the user can proceed automatically to the next class by entering 0 into the R console
If the user chooses to adjust starting number of samples and number of models, the user must enter the sample/model adjustment (../..) into the R console and evaluate the interactive probability maps again.
The evaluation, thresholding or re-running, and export process is repeated for each class.
Note If convergence is not possible, i.e., no
models
can be selected or num_samples
are not
enough or another error occurs, it is possible to restart by re-running
the multi_Class_Sampling
function again with the following
new argument values:
# printed in console
num_samples # printed in console
num_models # at which step should the procedure start, e.g. use
step # step = 2 if the first habitat is already extracted
Generate a classification map based on the saved output of the threshold probability maps
# input files path, equals out_path vector of plot colors, one for each
# class type
plot_results(in_path, color)
Note: If color is not specified, an internal color will be called. To set color manually create a vector with 7 colors corresponding to each class (see R color cheatsheet for a detailed list of colors and tips)