Where to store the N2KHAB input data sources

Introduction

For a practical recipe to setup n2khab_data, go to Getting started.

See vignette("v022_example") if you’d like to be guided by a hands-on example.

Distribution of the data sources used by n2khab functions

Apart from several textual datasets, provided directly with this package, other N2KHAB data sources 1 are binary or large data. Those are made available through cloud-based infrastructure, preserved for the future at least via Zenodo (see below).2 An overview of data distribution pathways is given here.

More about Zenodo

Zenodo is a scientific repository funded by the European Commission and hosted at CERN.

  • We prefer Zenodo for its straightforward, easy approach of preserving data sources for the long term – needed for reproducibility – while providing a stable DOI link for each version and for each record as whole.
  • Managing the N2KHAB data sources at Zenodo allowed us to apply a uniform and pure representation of each data source: we made one data version correspond to one data set following one fileformat (not zipped); we add no other files (e.g. metadata files like pdf’s, alternative fileformats etc.). The filenames in these Zenodo records follow the codes of the data sources that are used in the n2khab package.
  • Zenodo storage nicely fits with an internationalized approach of reproducible N2KHAB workflows in R, as its website is in English.

Local data storage

Data sources evolve, and hence, data source versions succeed one another. To ease reproducibility of analytical workflows, this package assumes locally stored data sources.

The n2khab functions, aimed at reading these data and returning them in R in some kind of standardized way, always provide arguments to specify the file’s name and location – so you can in fact freely choose these. However, to ease collaboration in scripting, it is highly recommended to follow the below standard locations and filenames (see: Getting started). Moreover, the functions assume these conventions by default in order to make your life easier!

There is a major distinction between:

  • raw data (Zenodo-link), to be stored in a directory n2khab_data/10_raw;
  • processed data (Zenodo-link), to be stored in a directory n2khab_data/20_processed. These data sources have been derived from the raw data sources, but are distributed on their own because of the time-consuming or intricate calculations needed to reproduce them.

You can reproduce the processed data sources from a shell script on Github, but it will take hours.

These binary or large data sources are to be stored in a dedicated directory n2khab_data on your system. Don’t use this special directory for adding other data. It can reside inside one project or repository but it can also deliver to several projects / repositories; see further. n2khab_data should always be ignored by version control systems.

Getting started for your (collaborative) workflow

Mind that, if you store the n2khab_data directory inside a version controlled repository (e.g. using git), it must be ignored by version control!

  1. Decide where you want to store the n2khab_data directory:

    • from the viewpoint of several projects / several git repositories, when these need the same data source versions, the location may be at a high level in your file system. A convenient approach is to use the directory which holds the different project directories / repositories.
    • from the viewpoint of one project / repository: the n2khab_data directory can be put inside the project / repository directory. This approach has the advantage that you can store versions of data sources different from those in another repository (where you also have an n2khab_data directory).

    For the functions to succeed in finding the n2khab_data directory in each collaborator’s file system, make sure that the directory is present either in the working directory of your R scripts or in a path at some level above this working directory. By default, the functions search the directory in that order and use the first encountered n2khab_data directory. Alternatively, you can set an environment variable N2KHAB_DATA_PATH or option n2khab_data_path to enforce a specific directory on your system that all n2khab functions will use (do that outside the files you collaborate on and share; see n2khab_options()).

  2. From your working directory, use fileman_folders() to specify the desired location (using the function’s arguments). It will check the existence of the directories n2khab_data, n2khab_data/10_raw and n2khab_data/20_processed and create them if they don’t exist.

fileman_folders(root = "rproj")
#> Created <clipped_path_prefix>/n2khab_data
#> Created subfolder 10_raw
#> Created subfolder 20_processed
#> [1] "<clipped_path_prefix>/n2khab_data"
  1. From the cloud storage (links: raw data | processed data), download the respective data files of a data source. You can also use the function download_zenodo() to do that, using the DOI of each data source version. For each data source, put its file(s) in an appropriate subdirectory either below n2khab_data/10_raw or n2khab_data/20_processed (depending on the data source). Use the data source’s default name for the subdirectory. You get a list of the data source names with XXX. These names are version-agnostic! The names of the n2khab ‘read’ function and their documentation make clear which data sources you will need.

    Below is an example of correctly organised N2KHAB data directories:

n2khab_data
    ├── 10_raw
    │     ├── habitatmap            -> contains habitatmat.shp, habitatmap.dbf etc.
    │     ├── soilmap
    │     └── GRTSmaster_habitats
    └── 20_processed
          ├── habitatmap_stdized
          └── GRTSmh_diffres

  1. N2KHAB data sources are a list of public, standard data sources, important to analytical workflows concerning Natura 2000 (n2k) habitats (hab) in Flanders. They are in a public repository in order to be easily findable and to be preserved in a durable way.↩︎

  2. This also means that several previously published (open) data sources have been publicly redistributed at Zenodo.↩︎