---
title: "Working with geospatial data sources of habitat (sub)types and RIBs"
author: "Floris Vanderhaeghe, Toon Westra, Jan Wouters, Cécile Herr"
date: "2021-05-12"
bibliography: references.bib
output: 
  rmarkdown::html_vignette:
    pandoc_args:
        - --csl
        - research-institute-for-nature-and-forest.csl
vignette: >
  %\VignetteIndexEntry{025. Working with geospatial data sources of habitat (sub)types and RIBs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
options(stringsAsFactors = FALSE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  paged.print = FALSE,
  eval = FALSE
)
```

_General note: the below vignette contains frozen output of 12 May 2021._
_This makes it possible to build the package with vignettes without access to the data sources._


```{r eval=TRUE, results='hide', message=FALSE, warning=FALSE}
library(n2khab)
library(sf)
library(dplyr)
library(units)
```

## Intro

This vignette learns you:

- the difference between various geospatial N2KHAB data sources on the occurrence of types in Flanders (types being habitat (sub)types or regionally important biotopes (RIBs); see `read_types()`);
- how to load these data sources into R;
- how to optimally match an interpreted data source with a 'types' column in your own data frame.

## Data sources

The below N2KHAB data sources are available.
For practical information on data storage and locations, see `vignette("v020_datastorage")`.

- **raw data sources**:
    - **`habitatmap`** ([Zenodo-link](https://doi.org/10.5281/zenodo.3354381)): geospatial polygons of BWK and Natura 2000 habitat types in the Flemish Region, originally published by the Research Institute for Nature and Forest (INBO) [@de_saeger_biologische_2020] and distributed by 'Informatie Vlaanderen'.
    - **`habitatstreams`** ([Zenodo-link](https://doi.org/10.5281/zenodo.3386245)): geospatial lines of the Natura 2000 habitat type `3260` that correspond with its presence in streaming water segments in the Flemish Region, originally published by INBO [@leyssen_indicatieve_2020] and distributed by 'Informatie Vlaanderen'.
    - **`habitatsprings`** ([Zenodo-link](https://doi.org/10.5281/zenodo.3550994)): geospatial points hat correspond with the presence or absence of the Natura 2000 habitat type `7220` in springs and streaming water segments in the Flemish Region.
    The data source is produced, owned and administered by INBO.
    - **`habitatquarries`** ([Zenodo-link](https://doi.org/10.5281/zenodo.4072967)): geospatial polygons that correspond with the presence or absence of the Natura 2000 habitat type `8310` in underground marl quarries in the Flemish Region (and border areas).
    The data source is produced, owned and administered by INBO.
- **processed data sources**:
    - **`habitatmap_stdized`** ([Zenodo-link](https://doi.org/10.5281/zenodo.3355192)):
derived from `habitatmap`.
This datasource only contains polygons with habitat (sub)type or RIB codes.
All codes conform to the `types` reference list (see `read_types()`).
The datasource is a GeoPackage with essentially two objects: 1) the geospatial polygons; 2) a long-formatted table with the associated types (multiple records can occur per `polygon_id`).
For more details see `read_habitatmap_stdized()`.
    - **`habitatmap_terr`** ([Zenodo-link](https://doi.org/10.5281/zenodo.3468948)):
the further interpreted, terrestrial part of `habitatmap_stdized` ('terrestrial' includes 'semi-terrestrial').
Amongst other properties, it excludes polygons _without_ terrestrial types, it excludes rows which most probably are no habitat or RIB at all, and several main type codes were translated to a corresponding subtype which they almost always represent.
The datasource is a GeoPackage, organized in the same way as `habitatmap_stdized`.
For more details see `read_habitatmap_terr()`.
    - **`watersurfaces_hab`** ([Zenodo-link](https://doi.org/10.5281/zenodo.3374645)): 
a combination of the environmental data source `watersurfaces` (@leyssen_watervlakken_2020; see `read_watersurfaces()`) and the above data source `habitatmap_stdized`.
It represents polygons (from `watersurfaces` or `habitatmap_stdized`) that completely or partly contain standing water habitat types or RIBs (i.e. `2190_a`, '`31..`' types or `rbbah`; 
types `3260` and `7220` are not included).
The datasource is a GeoPackage, organized in the same way as `habitatmap_stdized`.
For more details see `read_watersurfaces_hab()`.


## Using the `read_...()` functions

In the below R code, it is supposed that a `n2khab_data` folder is present in your working directory or a directory up to 10 levels higher, with the relevant data sources in the right place.
See `vignette("v020_datastorage")` for the setup.

The advantage of using the `read_...()` functions is the uniform approach to read the above data sources in R, enhancing collaboration and speeding up your work.

- Geospatial layers are represented as an `sf` object, in the coordinate reference system (CRS) 'Belge 1972 / Belgian Lambert 72' (EPSG-code [31370](https://epsg.io/31370)).
- Dataframes are presented as tibbles. ^[A tibble is a data frame that makes working in the tidyverse a little [easier](https://r4ds.had.co.nz/tibbles.html).]
- English & [tidyverse-styled](https://style.tidyverse.org/) names of variables stimulate you to always produce internationalized code, tables and figures.
- The `type` variable (if present) is a factor with the same levels as the `type` variable of the `types` reference list (see `read_types()`).
- Some variables may be discarded that should normally have no added value, e.g. values that duplicate already existing information.

### Returning raw data sources

The functions do just basic preprocessing, in order to return (by default) an object that well reflects the structure of the raw data source and returns all its records.

```{r}
habitatmap <- read_habitatmap()
habitatmap
#> Simple feature collection with 646589 features and 30 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 21991.38 ymin: 153058.3 xmax: 258871.8 ymax: 244027.3
#> Projected CRS: Belge 1972 / Belgian Lambert 72
#> # A tibble: 646,589 x 31
#>    polygon_id  eval  eenh1 eenh2 eenh3 eenh4 eenh5 eenh6 eenh7 eenh8 v1    v2
#>  * <chr>       <fct> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 000098_v20… m     b     <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#>  2 000132_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#>  3 000135_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#>  4 000136_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#>  5 000142_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#>  6 000150_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#>  7 000297_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#>  8 000991_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#>  9 000999_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 10 001000_v20… m     bl    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> # … with 646,579 more rows, and 19 more variables: v3 <chr>, source <chr>,
#> #   info <chr>, bwk_label <chr>, hab1 <chr>, phab1 <int>, hab2 <chr>,
#> #   phab2 <int>, hab3 <chr>, phab3 <int>, hab4 <chr>, phab4 <int>, hab5 <chr>,
#> #   phab5 <int>, source_hab <chr>, source_phab <chr>, hab_legend <fct>,
#> #   area_m2 <dbl>, geometry <MULTIPOLYGON [m]>
```

The `habitatmap` object is a very wide data frame which is not so easy to handle in R.
For analytical work on habitat types and RIBs, you're advised to use the tidied data source `habitatmap_stdized`; see [below](#processed).

`read_habitatmap(filter_hab = TRUE)` will only retain the polygons that occur in `habitatmap_stdized`, i.e. those that contain habitat types or RIBs.

The data sources `habitatstreams`, `habitatsprings` and `habitatquarries` have a straightforward data structure.
There was no need to generate derived data sources.
The meaning of their columns is described in the function documentation.
Not all function arguments are discussed in this vignette: do take the time to look at the documentation!

```{r}
habitatstreams <- read_habitatstreams()
habitatstreams
#> Simple feature collection with 560 features and 3 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 33097.92 ymin: 157529.6 xmax: 254039 ymax: 243444.6
#> Projected CRS: Belge 1972 / Belgian Lambert 72
#> # A tibble: 560 x 4
#>    river_name   source_id type                                          geometry
#>    <fct>        <fct>     <fct>                                 <LINESTRING [m]>
#>  1 Wolfputbeek  VMM       3260  (127857.1 167681.2, 127854.9 167684.5, 127844 1…
#>  2 Oude Kale    VMM       3260  (95737.01 196912.9, 95732.82 196912.4, 95710.38…
#>  3 Venloop      EcoInv    3260  (169352.7 209314.9, 169358.8 209290.5, 169326.2…
#>  4 Venloop      EcoInv    3260  (169633.6 209293.5, 169625 209289.2, 169594.4 2…
#>  5 Kleine Nete  EcoInv    3260  (181087.1 208607.2, 181088.6 208608.1, 181089 2…
#>  6 Kleine Nete  EcoInv    3260  (180037.4 208360.4, 180038.3 208377.5, 180038.3…
#>  7 Kleine Nete  EcoInv    3260  (180520 208595.7, 180540.5 208607.4, 180541.2 2…
#>  8 Kleine Nete  EcoInv    3260  (187379.9 209998.8, 187381.3 209998.5, 187381.6…
#>  9 Raamdonkseb… extrapol  3260  (183545.5 192409, 183541.9 192406.7, 183541.9 1…
#> 10 Kleine Nete  EcoInv    3260  (183516.4 208261.7, 183567.3 208279.2, 183567.3…
#> # … with 550 more rows
```

With `read_habitatstreams(source_text = TRUE)` a second object `sources` is returned with the meaning of the `source_id` codes:

```{r}
read_habitatstreams(source_text = TRUE) %>%
  .$sources
#> # A tibble: 7 x 2
#>   source_id          source_text
#>   <fct>              <fct>
#> 1 VMM                "Gegevens afgeleid van macrofyteninventarisaties uitgevoer…
#> 2 EcoInv             "Tijdens ecologische inventarisatiestudies uitgevoerd in o…
#> 3 extrapol           "De conclusie van het nabijgelegen geïnventariseerde segme…
#> 4 Van Belleghem (20… "Macrofytengegevens afgeleid van Van Belleghem S., Bal K.,…
#> 5 Leyssen ea (2005)  "Macrofytengegevens afgeleid van Leyssen A., Adriaens P., …
#> 6 WrnBe              "Waarnemingen afkomstig van Waarnemingen.be, de website vo…
#> 7 INBO               "Tijdens veldbezoeken werd de aan- of afwezigheid van het …
```


```{r}
habitatsprings <- read_habitatsprings()
habitatsprings
#> Simple feature collection with 104 features and 11 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 36407.84 ymin: 155249.7 xmax: 258371.1 ymax: 179732
#> Projected CRS: Belge 1972 / Belgian Lambert 72
#> # A tibble: 104 x 12
#>    point_id name       system_type code_orig type  certain unit_id area_m2  year
#>       <dbl> <chr>      <fct>       <chr>     <fct> <lgl>     <dbl>   <dbl> <dbl>
#>  1        1 Steenputb… rivulet     7220      7220  TRUE          1      50  2014
#>  2        2 Steenputb… rivulet     7220      7220  TRUE          1      20  2014
#>  3        3 Duling     <NA>        7230      7230  TRUE         NA      NA  2014
#>  4        4 Kapittelb… <NA>        gh        <NA>  TRUE         NA      NA  2014
#>  5        5 Remersdaa… rivulet     7220      7220  TRUE          2     200  2014
#>  6        6 Remersdaa… rivulet     7220      7220  TRUE          2     500  2014
#>  7        7 Kesterbeek unknown     7220,gh   7220  FALSE        32      NA    NA
#>  8        8 Krindaal   rivulet     7220      7220  TRUE          3     800  2014
#>  9        9 Bois de B… rivulet     7220      7220  TRUE          4      50  2014
#> 10       10 KwintenHo… mire        7220      7220  TRUE          5      10  2014
#> # … with 94 more rows, and 3 more variables: in_sac <lgl>, source <chr>,
#> #   geometry <POINT [m]>
```

Do note that both the `habitatsprings` and `habitatquarries` data sources also contain records which are no habitat type.
You can filter the habitat types by setting the `filter_hab` argument as `TRUE` (not shown).

```{r}
habitatquarries <- read_habitatquarries()
habitatquarries
#> Simple feature collection with 45 features and 6 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 221427.3 ymin: 160393.5 xmax: 243211.1 ymax: 168965.1
#> Projected CRS: Belge 1972 / Belgian Lambert 72
#> # A tibble: 45 x 7
#>    polygon_id unit_id name        code_orig type  extra_reference
#>         <int>   <int> <chr>       <chr>     <fct> <fct>
#>  1          4       4 Avergat - … gh        <NA>  Lahaye 2018
#>  2          6       6 Avergat - … 8310      8310  Lahaye 2018
#>  3          5       5 Avergat - … gh        <NA>  Lahaye 2018
#>  4         20      20 Coolen      8310      8310  Limburgs Landschap 2020; pers…
#>  5         21      21 Coolen      gh        <NA>  Limburgs Landschap 2020
#>  6         29      29 Groeve Lin… 8310      8310  <NA>
#>  7         31      31 Grote berg… 8310      8310  De Haan & Lahaye 2018
#>  8         37      37 Grote berg… 8310      8310  De Haan & Lahaye 2018
#>  9         24      24 Henisdael … 8310      8310  Dusar et al. 2007
#> 10         34      34 Henisdael … 8310      8310  Dusar et al. 2007
#> # … with 35 more rows, and 1 more variable: geom <POLYGON [m]>
```

The habitatquarries data source also includes the literature references.
These can be added as a second data frame -- in that case a list is returned:

```{r}
habitatquarries2 <- read_habitatquarries(references = TRUE)
class(habitatquarries2)
#> [1] "list"
```


```{r}
names(habitatquarries2)
#> [1] "habitatquarries"  "extra_references"
```


```{r}
all.equal(habitatquarries, habitatquarries2$habitatquarries)
#> [1] TRUE
```


```{r}
habitatquarries2$extra_references
#> # A tibble: 9 x 23
#>   category bibtexkey address author booktitle journal month note  number pages
#>   <chr>    <chr>     <chr>   <chr>  <chr>     <chr>   <chr> <chr> <chr>  <chr>
#> 1 BOOK     de_haan_… Brusse… De Ha… <NA>      <NA>    <NA>  <NA>  <NA>   <NA>
#> 2 INCOLLE… dusar_me… Genk    Dusar… Likona j… <NA>    <NA>  <NA>  <NA>   6-13
#> 3 BOOK     jenneken… Riemst  Jenne… <NA>      <NA>    <NA>  <NA>  <NA>   <NA>
#> 4 INCOLLE… lahaye_g… Riemst  Lahay… <NA>      <NA>    <NA>  <NA>  <NA>   12
#> 5 BOOK     verhoeve… Weert   Verho… <NA>      <NA>    <NA>  <NA>  1769   <NA>
#> 6 BOOK     walschot… <NA>    Walsc… <NA>      <NA>    <NA>  <NA>  <NA>   <NA>
#> 7 MISC     wikipedi… <NA>    {Wiki… <NA>      <NA>    jan   Page… <NA>   <NA>
#> 8 MISC     limburgs… <NA>    {Limb… <NA>      <NA>    apr   Libr… <NA>   <NA>
#> 9 ARTICLE  silverta… <NA>    Silve… <NA>      Natuur… <NA>  <NA>  12     334-…
#> # … with 13 more variables: publisher <chr>, series <chr>, title <chr>,
#> #   volume <chr>, year <dbl>, url <chr>, isbn <chr>, copyright <chr>,
#> #   abstract <chr>, language <chr>, urldate <chr>, issn <chr>, keywords <chr>
```

These extra references can also be printed to the R console in BibTeX format, when specifying `bibtex = TRUE`.

### Returning processed data sources {#processed}

The reading functions for the three processed data sources will always return a **list**:

- the first element holds the geospatial information as an **sf object**.
It has unique feature IDs that can be joined with rows of the second element;
- the second element is a **tibble** that presents attributes of the features as _**[tidy](https://r4ds.had.co.nz/tidy-data.html#tidy-data-1) data**_.

Let's see how this works!

#### read_habitatmap_stdized()

A tidy representation of the `habitatmap` data, restricted to the polygons that contain habitat types or RIBs, is returned by `read_habitatmap_stdized()`:

```{r}
hms <- read_habitatmap_stdized()
```


```{r}
hms_pol <- hms$habitatmap_polygons
hms_pol
#> Simple feature collection with 87781 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 22003.2 ymin: 153084.4 xmax: 258871.8 ymax: 243446.1
#> Projected CRS: Belge 1972 / Belgian Lambert 72
#> # A tibble: 87,781 x 3
#>    polygon_id  description_orig                                             geom
#>  * <fct>       <chr>                                          <MULTIPOLYGON [m]>
#>  1 130153_v20… 70% 9120_qb; 30% … (((150669.4 227248.6, 150668.1 227242.9, 1505…
#>  2 130815_v20… 70% gh; 30% 9160   (((258338.4 158696.1, 258336.5 158693.5, 2583…
#>  3 137826_v20… 100% 6430,rbbhf    (((181587.2 234938, 181613.1 234933.1, 181646…
#>  4 170624_v20… 100% rbbmr         (((145876.8 229686.8, 145701.7 229680.6, 1456…
#>  5 203261_v20… 70% gh; 30% rbbmr  (((117137.2 210307.9, 117136.3 210288.3, 1171…
#>  6 204352_v20… 80% gh; 20% rbbsp  (((116357.3 159278.3, 116340.5 159259.8, 1163…
#>  7 204376_v20… 90% gh; 10% rbbsg  (((116110.1 210545.5, 116102.4 210541.1, 1160…
#>  8 205188_v20… 60% rbbsp; 40% gh  (((232114.7 161594.5, 232122.7 161590.5, 2321…
#>  9 205291_v20… 70% gh; 30% rbbsp  (((191253.5 160641.5, 191254.8 160636.1, 1912…
#> 10 205756_v20… 70% gh; 30% rbbsf  (((216258.8 156749, 216260.9 156750, 216287.7…
#> # … with 87,771 more rows
```

```{r}
hms_occ <- hms$habitatmap_types
hms_occ
#> # A tibble: 110,485 x 5
#>    polygon_id   type     certain code_orig  phab
#>    <fct>        <fct>    <lgl>   <chr>     <int>
#>  1 000038_v2016 91E0_va  TRUE    91E0_va     100
#>  2 000043_v2016 9130_end TRUE    9130_end    100
#>  3 000064_v2020 9130_end TRUE    9130_end    100
#>  4 000132_v2016 9130_end TRUE    9130_end    100
#>  5 000204_v2016 91E0_vn  TRUE    91E0_vn     100
#>  6 000255_v2016 91E0_vc  TRUE    91E0_vc     100
#>  7 000297_v2016 rbbsp    TRUE    rbbsp        10
#>  8 000311_v2016 9130_end TRUE    9130_end     70
#>  9 000311_v2016 rbbsp    TRUE    rbbsp        30
#> 10 000390_v2016 91E0_vn  TRUE    91E0_vn      70
#> # … with 110,475 more rows
```

Let's estimate the surface area per type, including uncertain occurrences of types and taking into account cover percentage per polygon (`phab`):

```{r}
hms_pol %>%
  mutate(area = st_area(.)) %>%
  st_drop_geometry() %>%
  inner_join(hms_occ, by = "polygon_id") %>%
  # area of type within polygon:
  mutate(area_type = area * phab / 100) %>%
  group_by(type) %>%
  summarise(area = sum(area_type) %>%
    set_units("ha") %>%
    round(2))
#> # A tibble: 101 x 2
#>    type        area
#>    <fct>       [ha]
#>  1 1130     5678.87
#>  2 1140     2098.03
#>  3 1310_pol   17.39
#>  4 1310_zk    34.02
#>  5 1310_zv    10.62
#>  6 1320        2.41
#>  7 1330_da   117.61
#>  8 1330_hpr  138.12
#>  9 2110       27.14
#> 10 2120      462.54
#> # … with 91 more rows
```

#### read_habitatmap_terr()

`read_habitatmap_terr()` behaves exactly the same way:

```{r}
hmt <- read_habitatmap_terr()
```

```{r}
hmt$habitatmap_terr_polygons
#> Simple feature collection with 78602 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 22003.2 ymin: 153084.4 xmax: 258871.8 ymax: 243351.8
#> Projected CRS: Belge 1972 / Belgian Lambert 72
#> # A tibble: 78,602 x 5
#>    polygon_id  description_orig  description source                         geom
#>  * <fct>       <chr>             <chr>       <fct>            <MULTIPOLYGON [m]>
#>  1 130153_v20… 70% 9120_qb; 30%… 70% 9120_q… habita… (((150669.4 227248.6, 1506…
#>  2 130815_v20… 70% gh; 30% 9160  70% gh; 30… habita… (((258338.4 158696.1, 2583…
#>  3 137826_v20… 100% 6430,rbbhf   100% 6430_… habita… (((181587.2 234938, 181613…
#>  4 170624_v20… 100% rbbmr        100% rbbmr  habita… (((145876.8 229686.8, 1457…
#>  5 203261_v20… 70% gh; 30% rbbmr 70% gh; 30… habita… (((117137.2 210307.9, 1171…
#>  6 204352_v20… 80% gh; 20% rbbsp 80% gh; 20… habita… (((116357.3 159278.3, 1163…
#>  7 204376_v20… 90% gh; 10% rbbsg 90% gh; 10… habita… (((116110.1 210545.5, 1161…
#>  8 205188_v20… 60% rbbsp; 40% gh 60% rbbsp;… habita… (((232114.7 161594.5, 2321…
#>  9 205291_v20… 70% gh; 30% rbbsp 70% gh; 30… habita… (((191253.5 160641.5, 1912…
#> 10 205756_v20… 70% gh; 30% rbbsf 70% gh; 30… habita… (((216258.8 156749, 216260…
#> # … with 78,592 more rows
```

```{r}
hmt$habitatmap_terr_types
#> # A tibble: 99,784 x 6
#>    polygon_id   type     certain code_orig  phab source
#>    <fct>        <fct>    <lgl>   <chr>     <int> <fct>
#>  1 000038_v2016 91E0_va  TRUE    91E0_va     100 habitatmap_stdized
#>  2 000043_v2016 9130_end TRUE    9130_end    100 habitatmap_stdized
#>  3 000064_v2020 9130_end TRUE    9130_end    100 habitatmap_stdized
#>  4 000132_v2016 9130_end TRUE    9130_end    100 habitatmap_stdized
#>  5 000204_v2016 91E0_vn  TRUE    91E0_vn     100 habitatmap_stdized
#>  6 000255_v2016 91E0_vc  TRUE    91E0_vc     100 habitatmap_stdized
#>  7 000297_v2016 rbbsp    TRUE    rbbsp        10 habitatmap_stdized
#>  8 000311_v2016 9130_end TRUE    9130_end     70 habitatmap_stdized
#>  9 000311_v2016 rbbsp    TRUE    rbbsp        30 habitatmap_stdized
#> 10 000390_v2016 91E0_vn  TRUE    91E0_vn      70 habitatmap_stdized
#> # … with 99,774 more rows
```

Compared to `habitatmap_stdized`, _purely_ aquatic or non-habitat/RIB polygons were omitted, and a part of the type data were interpreted in a more specific way.
Further, while type `7220` is present in `habitatmap_terr`, the `read_habitatmap_terr()` function drops it by default because the `habitatsprings` data source is recommended for that.
This can be controlled by the `drop_7220` argument.

As a consequence, some type codes are completely absent from `habitatmap_terr_types`:

```{r}
hms_occ %>%
  distinct(type) %>%
  anti_join(
    hmt$habitatmap_terr_types %>%
      distinct(type),
    by = "type"
  ) %>%
  arrange(type)
#> # A tibble: 7 x 1
#>   type
#>   <fct>
#> 1 2190
#> 2 6410
#> 3 6430
#> 4 6510
#> 5 7140
#> 6 7220
#> 7 9130
```

About 3% of all type occurrences received a new type code:

```{r}
hmt$habitatmap_terr_types %>%
  count(source) %>%
  mutate(pct = (n / sum(n) * 100) %>% round(0))
#> # A tibble: 2 x 3
#>   source                                  n   pct
#>   <fct>                               <int> <dbl>
#> 1 habitatmap_stdized                  96424    97
#> 2 habitatmap_stdized + interpretation  3360     3
```

#### read_watersurfaces_hab()

A similar story, this time for polygons that (could) have aquatic types:

```{r}
wsh <- read_watersurfaces_hab()
```

```{r}
wsh$watersurfaces_polygons
#> Simple feature collection with 3233 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 22546.57 ymin: 159273.1 xmax: 253896.9 ymax: 242960.1
#> Projected CRS: Belge 1972 / Belgian Lambert 72
#> # A tibble: 3,233 x 5
#>    polygon_id polygon_id_ws polygon_id_habitatm… description_orig
#>  * <fct>      <fct>         <fct>                <chr>
#>  1 ANTANT0082 ANTANT0082    596466_v2014         60% 3150; 20% rbbmr; 20% rbbsf
#>  2 ANTANT0234 ANTANT0234    633396_v2020         100% 3130_na
#>  3 ANTANT0251 ANTANT0251    113978_v2014         100% 3150
#>  4 ANTANT0253 ANTANT0253    111606_v2014         100% 3150
#>  5 ANTANT0297 ANTANT0297    409153_v2014+409153… 85% 3140; 15% 3150+85% 3140; 1…
#>  6 ANTANT0315 ANTANT0315    519082_v2018         100% 3140
#>  7 ANTANT0319 ANTANT0319    601958_v2014         100% 3150,gh
#>  8 ANTANT0381 ANTANT0381    644003_v2014         85% gh; 15% 3140
#>  9 ANTANT0383 ANTANT0383    631879_v2014+593522… 50% 3150; 40% rbbmr; 10% rbbsf…
#> 10 ANTANT0384 ANTANT0384    644003_v2014         85% gh; 15% 3140
#> # … with 3,223 more rows, and 1 more variable: geom <MULTIPOLYGON [m]>
```

```{r}
wsh$watersurfaces_types
#> # A tibble: 3,669 x 4
#>    polygon_id type    certain code_orig
#>    <fct>      <fct>   <lgl>   <chr>
#>  1 ANTANT0082 3150    TRUE    3150
#>  2 ANTANT0234 3130_na TRUE    3130_na
#>  3 ANTANT0251 3150    TRUE    3150
#>  4 ANTANT0253 3150    TRUE    3150
#>  5 ANTANT0297 3140    TRUE    3140
#>  6 ANTANT0297 3150    TRUE    3150
#>  7 ANTANT0315 3140    TRUE    3140
#>  8 ANTANT0319 3150    FALSE   3150,gh
#>  9 ANTANT0381 3140    TRUE    3140
#> 10 ANTANT0383 3150    TRUE    3150
#> # … with 3,659 more rows
```

Let's compute some statistics of standing water types (ignoring the value `certain`):

```{r}
wsh$watersurfaces_polygons %>%
  mutate(area = st_area(.)) %>%
  st_drop_geometry() %>%
  inner_join(wsh$watersurfaces_types,
    by = "polygon_id"
  ) %>%
  group_by(type) %>%
  summarise(
    nr_watersurfaces = n_distinct(polygon_id),
    total_area = sum(area),
    area_min = min(area),
    area_Q1 = quantile(area, 0.25),
    area_Q2 = quantile(area, 0.5),
    area_Q3 = quantile(area, 0.75),
    max = max(area)
  ) %>%
  mutate_at(
    vars(matches("area|max")),
    function(x) {
      set_units(x, "a") %>% round(1)
    }
  )
#> # A tibble: 9 x 8
#>   type     nr_watersurfaces total_area area_min area_Q1 area_Q2 area_Q3     max
#>   <fct>               <int>        [a]      [a]     [a]     [a]     [a]     [a]
#> 1 2190_a                318     3605.6      0.0     0.9     2.1     5.5   317.5
#> 2 3110                    6     1876.8    111.7   139.7   216.9   330.3   849.3
#> 3 3130                  219    33022.9      0.6    13.0    41.0   160.0  2807.2
#> 4 3130_aom             1267   107732.6      0.1     2.3     9.7    44.1  7062.7
#> 5 3130_na               354   120174.2      1.0    44.7   122.5   277.9 13668.2
#> 6 3140                  104    72272.0      0.8    13.3    55.3   418.5 13668.2
#> 7 3150                  475   105019.1      0.4     6.6    17.3    83.5 13668.2
#> 8 3160                  379    27936.9      0.1     6.1    17.4    59.3  4179.1
#> 9 rbbah                 517    33716.3      0.1     4.2    10.9    47.0  3748.6
```

A strong point of `sf` objects is that the geometry has explicit units.
Consequently we can make use of tools like `set_units()` to convert units (e.g. surface area as _are_ (a)).

Because the main type code `3130` will mostly boil down to `3130_aom` in the field, a further interpreted flavour can be generated with `read_watersurfaces_hab(interpreted = TRUE)`.

```{r}
read_watersurfaces_hab(interpreted = TRUE) %>%
  .$watersurfaces_types %>%
  filter(type == "3130") %>%
  nrow()
#> [1] 0
```


## Matching interpreted data sources with a data frame column of types

The `expand_types()` function helps us to join a type column of your own data frame with the results of `read_habitatmap_terr()` or `read_watersurfaces_hab()`.

For both of these datasets, the following conversions of your data frame are supported in order to create optimal joins:

- adding extra _subtype_ rows when main type codes occur for which subtypes exist;
- adding extra _main type_ rows when certain subtypes of that main type are present (the conditions can be relaxed; see the function documentation).
This is supported for 2330, 6230 and 91E0, for which this makes sense with the two mentioned data sources.

An example (`df` is our data frame):

```{r eval=TRUE}
df <-
  tribble(
    ~mycode, ~obs,
    "2130", 5,
    "2190", 45,
    "2330_bu", 8,
    "2330_dw", 8,
    "6410_mo", 78,
    "6410_ve", 4,
    "91E0_vn", 10
  )
df
```

With the `type_var` argument you specify which variable of your data frame represents type codes:

```{r eval=TRUE}
df_exp <-
  expand_types(df, type_var = "mycode")
df_exp
```

More examples and features are explained in the documentation of `expand_types()`.

Obviously, more rows of `habitatmap_terr` will be retained by joining `df_exp`:

```{r warning=FALSE}
hmt$habitatmap_terr_types %>%
  semi_join(df_exp, by = c(type = "mycode")) %>%
  nrow()
#> [1] 6634
```

When joining with `df`:

```{r warning=FALSE}
hmt$habitatmap_terr_types %>%
  semi_join(df, by = c(type = "mycode")) %>%
  nrow()
#> [1] 4984
```


## References