1 Introduction

This manual introduces the Habitat Sampler (HaSa), an innovative tool that autonomously generates representative reference samples for predictive modeling of surface class probabilities. The tool can be applied to any image data that displays surface structures and dynamics of any kind at multiple spatial and temporal scales. HaSa was initially developed to classify habitat dynamics in semi-natural ecosystems, but the procedure can theoretically be applied to any surface. The main innovation of the tool is that it reduces reliance on comprehensive in situ ground truth data or comprehensive training data sets which constrain accurate image classification particularly in complex scenes.

Though development of HaSa has prioritized ease of use, this documentation assumes a familiarity with the R software. The document is built successively and is intended to lead you step-by-step through the HaSa procedure of generating probability and classification maps. HaSa is still in development and any suggestions or improvements are welcomed and encouraged in our GitLab Community Version. If questions remain please don’t hesitate to contact the authors of the package. For a detailed description of the Habitat Sampler and its applications, see Neumann et al., (2020).

1.1 Usage

The tool is implemented in R and uses Leaflet (Cheng et al., 2019) to generate interactive maps in a web browser. There are no assumptions about the input image data and there are no constraints for the spectral-temporal-spatial domain in which the image is sampled. The tool requires the input of a priori expert user knowledge to generate reference data about expected surface classes which are delineated in the imagery or extracted from an external spectral library. The user has the choice between image classifiers random forest (RF) and support vector (SV).

1.2 Sample datasets

The examples in this documentation use an L2A Sentinel-2 timeseries stack from 2018 (6 days, 9 bands each) and reference points that are included in the HaSa package. This Sentinel-2 data are from the Kyritz-Ruppiner Heide a former military training area northeast of Berlin, Germany. The open heathlands in the former military training area are designated protected areas under the European Natura 2000 network and are subject to management activities including tree removal, controlled burning and machine mowing. The reference data include 7 classes identified with a priori expert knowledge.

The Sentinel-2 data are downloaded and processed using the German Centre for Geosciences (GFZ) Time Series System for Sentinel-2 (GTS2). Data are made available and atmospherically corrected via a simple web API. Detailed information on the GTS2 system can be found here. The metadata of the Sentinel-2 data including the band ID in the timeseries stack are provided below (Table 1).

Sentinel-2 bands included in the timeseries stack and the correspoding band ID in the HaSa tool
	Band 2	Band 3	Band 4	Band 5	Band 6	Band 7	Band 8	Band 11	Band 12
Date	Blue	Green	Red	Red Edge 1	Red Edge 2	Red Edge 3	NIR	SWIR 1	SWIR 2
2018-03-03	1	2	3	4	5	6	7	8	9
2018-05-07	10	11	12	13	14	15	16	17	18
2018-06-06	19	20	21	22	23	24	25	26	27
2018-07-31	28	29	30	31	32	33	34	35	36
2018-09-09	37	38	39	40	41	42	43	44	45
2018-10-14	46	47	48	49	50	51	52	53	54

Example reference data are provided in two formats, as a data table and a point shapefile. The table includes spectral information from each class type desired for the classification and can be used directly to train the models (rows = class, columns = spectral bands). The first row must contain the spectral bands names and this must match the bands of the input image data.

The point shapefile contains a point location per class and is used to extract the reference data. In the event the shapefile does not have the same projection as the input image data, it will be automatically reprojected. The class names must be passed as a vector in the same order as the reference spectra (rows = class).

2 HaSa installation

The following procedure will lead you through the preliminary steps required to setup the HaSa tool.

2.1 HaSa dependencies

The installation assumes that R is installed in the user’s system, as well, as certain system packages dependencies. If it is not the case, the user should install the following packages (we only provide information for a Debian OS, such as Ubuntu):

    # Install dependencies
    apt-get install -y protobuf-compiler libprotobuf-dev proj-bin gdal-bin libgdal-dev 

    # Install dev-dependencies
    apt-get install -y jq libjq-dev libv8-dev pandoc libsodium-dev libsecret-1-dev
    libfreetype6-dev fontconfig libfontconfig1-dev r-cran-lubridate qpdf ghostscript

    # Install R
    apt-get install -y r-base

2.2 Install HaSa

HaSa is not yet available as a CRAN package. The user needs to install it directly from its repository. Since versions of certain packages were pinned (c.f., Section 2.1), the user should use upgrade = FALSE when calling remotes::install_git. In the following example the user can install the HaSa R package, build its manual and its vignettes.

library(remotes)
remotes::install_git("https://git.gfz-potsdam.de/habitat-sampler/HabitatSampler.git",
    ref = "master", subdir = "R-package/hasa", dependencies = NA, upgrade = FALSE,
    build = TRUE, build_manual = TRUE, build_vignettes = TRUE)

2.2.1 Fast installation

The installation might take several minutes until it is concluded. To speed up the process the user should skip the build of the documentation and vignettes, and install dependencies concurrently (threads = <num_of_cores - 1>):

library(remotes)
remotes::install_git("https://git.gfz-potsdam.de/habitat-sampler/HabitatSampler.git",
    ref = "master", subdir = "R-package/hasa", dependencies = NA, upgrade = FALSE,
    build = TRUE, build_manual = FALSE, build_vignettes = FALSE, quick = TRUE,
    threads = 6)

2.3 Load HaSa

Before the user starts using HaSa it is necessary to load the library and some of its dependencies. In the following example, the libraries to be loaded are passed as a list. The option options("rgdal_show_exportToProj4_warnings"="none") is to suppress the warning messages related with the latest changes in gdal and PROJ6.

options(rgdal_show_exportToProj4_warnings = "none")
libraries <- c("rgdal", "raster", "maptools", "randomForest", "e1071", "devtools",
    "fasterize", "rgeos", "leaflet", "htmlwidgets", "IRdisplay", "HaSa")
lapply(libraries, library, character.only = TRUE)

3 Load demo data

An important step preceding classification is to load the Sentinel-2 satellite timeseries stack, reference data, and class names. HaSa provides a set of functions to guide the user through the data loading process.

3.1 Data directories

Before loading the input data and using HaSa, the user needs to define a series of directory paths. They are from where HaSa will read input data, and store intermediates and final results. These directory paths are relative to the working directory path, i.e., wd. The following code sets all the paths assuming that the root path is the current directory, i.e., the demo directory.

wd <- paste(tools::file_path_as_absolute("./"), sep = "")

# demo data stored here
dataPath <- paste(wd, "/data", sep = "")

# output directory
out_path <- paste(wd, "Results/", sep = "")

# temporary directory for the raster package
raster::rasterOptions(tmpdir = "./RasterTmp/")

3.2 Satellite timeseries stack

The satellite time series is either passed as a 3.2.1 stack of images already clipped or 3.2.2 a stack of image to be clipped. In both cases, the input satellite images needs to either have a valid projection or the projection be passed as parameter, i.e., sat_crs_str = '+proj=utm +zone=32 +datum=WGS84 +units=m +no_defs', otherwise, the function will report error. Satellite time series data are available in dataPath.

satellite_series_path <- paste(dataPath, "/SentinelStack_2018.tif", sep = "")
timeseries_stack <- HaSa::load_timeseries_stack(satellite_series_path)

# Mask out NaNs in all layers
timeseries_stack <- raster::mask(timeseries_stack, raster::calc(timeseries_stack,
    fun = sum))

# See Table 1 for band IDs and corresponding Sentinel-2 bands
r = 19
g = 20
b = 21
raster::plotRGB(timeseries_stack, r = r, g = g, b = b, stretch = "lin", axes = T)

3.3 Selecting reference samples

Reference samples include spectral information of the defined classes as chosen by the user. Reference data can be either passed as a table 3.3.1 or a shapefile 3.3.2. Reference samples passed as a table can either be pre-extracted spectral information from an image or spectral information imported from a spectral library. Reference samples passed as a shapefile are point locations defined by the user on the satellite timeseries stack from where the spectral information are automatically extracted for each class. Sample reference data are available in dataPath in table and shapefile format.

3.2.1 Reference table

The table includes spectral information from each class type (rows = classes, columns = spectral reflectance). The first row must contain the spectral wavebands, the same names as the ones used in the input satellite time series stack (Table 2).

# load reference table. Each row represents a unique target class.
table_data_path <- paste(dataPath, "/Example_Reference_table.txt", sep = "")
ref <- HaSa::load_reference_as_table(table_data_path)

Reference data
	SentinelStack_2018.1	SentinelStack_2018.2	SentinelStack_2018.3	…	SentinelStack_2018.54
deciduous	1066	1069	915	…	1725
coniferous	656	687	444	…	1671
heather_young	2071	2303	2227	…	2726
heather_old	895	910	728	…	1413
heather_shrub	889	927	792	…	1718
bare_ground	2176	2335	2277	…	2139
xeric_grass	3952	4566	4757	…	4893

3.2.2 Reference points

The point shapefile contains a point location per class and is used to automatically extract the reference data. The wavelengths for each point are extracted from the Sentinel-2 timeseries stack using the R routine raster::extract. If the shapefile does not have the same projection as the input Sentinel-2 stack, HaSa will automatically reproject it to match the projection of the Sentinel-2 stack. The resulting table has the following format (rows = classes, columns = spectral wavebands).

shp_path <- paste(dataPath,"/Example_Reference_Points.shp", sep = "")
satellite_series_path <- paste(dataPath,"/SentinelStack_2018.tif", sep = "")

ref <- HaSa::load_reference_as_shape(
    shp_path, satellite_series_path,
    shp_crs_str = '+proj=utm +zone=33 +datum=WGS84 +units=m +no_defs'
  )

3.2.3 Define class names

The class names should be passed as a vector in the same order as the reference spectra (rows = class).

# create vector with class names. The order of classes must follow the
# same order of reference spectra (row = class)
class_names <- c("deciduous", "coniferous", "heather_young", "heather_old",
    "heather_shrub", "bare_ground", "xeric_grass")

4 Generating outputs

4.1 Calculating class probability

4.1.1 Plot configuration

The interactive mode of HaSa requires the user’s expertise to define a threshold for class extraction. With the help of an interactive map the user chooses the threshold used to filter out the pixels not belonging to a certain class. The interactive map includes an RGB composite of one of the Sentinel-2 scenes to assist in class extraction.

The satellite timeseries stack (SentinelStack_2018.tif) loaded in 3.2.1 has 6 scenes and each scene includes 9 bands (see Table 1). Using the clipped Sentinel-2 timeseries stack provided as input, the user can test which bands should be used in the plot using HaSa::plot_configuration(). The variable plot_rgb will later be used for the interactive procedure of class sampling.

shp_path <- paste(dataPath,"/Example_Reference_Points.shp", sep = "")
satellite_series_path <- paste(dataPath,"/SentinelStack_2018.tif",sep = "")
plot_rgb <- c("r" = 19, "g" = 20, "b" = 21)

HaSa::plot_configuration(
    shp_path, satellite_series_path,
    plot_rgb,
    shp_crs_str = '+proj=utm +zone=33 +datum=WGS84 +units=m +no_defs'
  )

4.1.2 Define color palette for probability plot

Light grey indicates a low probability and forest green a high probability.

col <- colorRampPalette(c("lightgrey", "orange", "yellow", "limegreen", "forestgreen"))

4.1.3 Class Sampling

Class sampling input

HaSa::multi_Class_Sampling(
    in_raster = raster_stk,   # clipped satellite time series stack [raster brick]
    num_samples = 75,         # starting number of spatial samples (recommended value: 75)
                              # for more info *See note 1
    sample_type = 
        "regular_raster",     # distribution of spatial samples 
                              # ("random_raster","regular_raster", "matrix_random")
                              # recommended: "regular_raster" *See note 2
    num_models = 200,         # number of models to collect (recommended value: 200)
    num_iterations = 10,      # number of iterations for model accuracy
                              # (recommended value:10)
    buffer = 15,              # distance (in m) for new sample collection around initial
                              # samples (depends on pixel size and image resolution)
                              # (e.g. Sentinel: Pixelsize = 10m; sqrt(10²+10²)=14.142 => 15m)
    reference = ref,          # table of reference spectra [data.frame]
    model = "rf",             # which machine learning algorithm to use ("rf" random
                              # forest or "svm" support vector machine;
                              # recommended input: rf)
    num_trees = 500,          # if the model is "rf" set the number of trees for the
                              # random forest algorithm, otherwise, the parameter is
                              # ignored *See note 3
    mtry = 8,                 # splitting nodes (recommended: mtry < number of predictors;
                              # formular: mtry = sqrt(number of layers in input raster); 
                              # e.g. Demo Data: layers = 54; sqrt(54) = 7,35 => 8)
    mod_error = 0.02,         # threshold for model error until which iteration is being
    last = FALSE,             # only TRUE for one class classifier
                              # recommended input: FALSE *See note 4
    last_ref_val = 500,       # default reference value for the last step (default: 500)
                              # for more info *See note 5
    seed = 
      as.integer(Sys.time()), # set seed for reproducible results *See note 6
    init_seed = "sample",     # "sample" for new or use Run@seeds to reproduce previous
                              # steps *See note 7
    out_path = out_path,      # output path for saving results
    step = 1,                 # at which step should the procedure start, e.g. use 
                              # step = 2 if the first habitat is already extracted
    class_names = class_names,# vector with class names in the order of reference spectra
    n_classes = 7,            # total number of classes to be separated
    multi_test = 1,           # number of test runs to compare different probability
                              # output *See note 8
    RGB = c(19,20,21),        # pallette colors for the interactive plots
    color = 
        c("lightgrey", "orange", "yellow", "limegreen", "forestgreen")
                              #  single colors for continuous color palette interpolation
    in_memory = TRUE,         # boolean for raster processing (memory = "TRUE", 
                              # from disk = "FALSE")
    optimized_mode = TRUE     # use the optimized mode (run in_memory if possible
                              # and use matrices instead of rasters)
    overwrite = TRUE,         # overwrite the KML and raster files from previous runs
    save_runs = TRUE,         # an class object is saved into disk for each run 
                              # (default TRUE)
    plot_on_browser = FALSE   # plot on the browser or inline in a notebook (default TRUE)
    )

Note 1: In case it is not possible to find models, increasing the number of num_samples and num_models is not always the solution. The user should also try to re-sample with a different seed value.
Note 2: There are three sampling strategies: random_raster (it uses raster::sampleRandom), raster_regular (it uses raster::sampleRegular), and random_matrix (it uses matrices and the stats::sample function over only existent non NaN pixels). The regular_raster -> fast: preferable at the beginning of sampling procedure. The random_raster -> slow: it only samples pixels with information and it is preferable to use at the final steps with few and irregular distributed pixels. The random_matrix -> fast: it only samples pixels with information and it is preferable to use at the final steps with few and irregular distributed pixels.
Note 3: The default value is 500, for small number of trees (at least 1/3 of the number of predictors) use an odd number for precise prediction results.
Note 4: The argument last = T can be set when only one class should be separated from the background pixels
Note 5: The reference data for the pseudo class in the last step is built using the value of last_ref_val. In case the user gets NA as last class, the user should adjust the value of last_ref_val and re-sample again.
Note 6: For different results the seed should have a different value on each run (use seed = as.numeric(Sys.time())). To repeat a specific run the user just needs to restart R, run again the HaSa::multi_Class_Sampling with the same arguments, this is, keep seed constant, and it will get the same results for all the steps. The exception is random_raster (it uses raster::sampleRandom). It will always produce slightly different results even with a seed provided. This behaviour comes from the package raster and can’t be changed.
Note 7: The results from previous steps are reproducible (again only with raster_regular and random_matrix) when using the same seed value and int.seed=Run@seeds (e.g. Run02@seeds) in consequence, init.sample for regular sampling determines an invariant sample distribution, use random sampling or vary init.sample to get varying sample distributions.
Note 8: If multi_test > 1 the user will get multiple maps and will be asked to enter the number of the probability distribution that is appropriate.

Habitat sampling output. R object that contains:

        models          # selected classifiers
        ref_samples     # spatial points of selected samples (see WriteOutSamples.r)
        switch          # the target class is [2] if switch is not NA then the target 
                        # class must be changed from [1] to [2] (see WriteOutSamples.r)
        layer           # raster layer of habitat type probability
        mod_all         # all classifiers from num_models
        class_ind       # predictive distance metric for all classes
        seeds           # seeds to reproduce respective step/habitat type sampling

4.1.4 Interactive probability maps and downloading output

An interactive map is plotted in a web browser (e.g., Firefox for Linux) containing a selected habitat type. The number of models predicting this habitat type can be viewed by hovering the mouse over the map.

From this interactive map, the user has two choices:

Extract class type on the basis of a threshold (minimum number of models assigning a pixel to that class type) if the probability map meets the user’s evaluation

If the user chooses to extract the class, a user defined threshold is entered into the R console and the following files are saved:

HabitatSampler object (Run) - R Binary: The R object is used when the user wants to restart the computation at a specific step or reuse the seeds for sampling.
probability map - .kml, .png, geocoded *.tif: Tiff contains all classes plotted, one class, one color. See example in the demo/Data/Results/HabitatMap_final.pdf
threshold list - R Binary
Rerun object - R Binary: For every step a rerun object is saved containing adapted class names, reference values and a mask to crop the input raster for reruns if a classification fails and the user wants to continue at that point
leaflet interactive web interface - *.html: LeafLet Map with the 3 RGB channels and the raster containing the probabilities. The file is re-written for each run

After class extraction is done the user can proceed automatically to the next class by entering 0 into the R console

Sample the class type again to produce a better probability map. If after visual inspection, the user is not satisfied they can adjust the number of samples and models. It is recommended to first increase the number of models (+50).

If the user chooses to adjust starting number of samples and number of models, the user must enter the sample/model adjustment (../..) into the R console and evaluate the interactive probability maps again.

The evaluation, thresholding or re-running, and export process is repeated for each class.

Note If convergence is not possible, i.e., no models can be selected or num_samples are not enough or another error occurs, it is possible to restart by re-running the multi_Class_Sampling function again with the following new argument values:

    num_samples    # printed in console
    num_models     # printed in console
    step           # at which step should the procedure start, e.g. use 
                   # step = 2 if the first habitat is already extracted

4.2 Generating classification map and summary statistics

Generate a classification map based on the saved output of the threshold probability maps

# input files path, equals out_path vector of plot colors, one for each
# class type
plot_results(in_path, color)

Note: If color is not specified, an internal color will be called. To set color manually create a vector with 7 colors corresponding to each class (see R color cheatsheet for a detailed list of colors and tips)

An Introduction to Habitat Sampler

Carsten Neumann, Alison Beamish, Romulo Goncalves

2022-10-17