--- title: "Where to store the N2KHAB input data sources" author: "Floris Vanderhaeghe" date: "2019-08-02" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{020. Where to store the input data sources} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} options(stringsAsFactors = FALSE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction For a practical recipe to setup `n2khab_data`, go to [Getting started](#getting-started). See `vignette("v022_example")` if you'd like to be guided by a hands-on example. ### Distribution of the data sources used by `n2khab` functions Apart from several textual datasets, provided directly with this package, other N2KHAB data sources [^N2KHAB] are binary or large data. Those are made available through cloud-based infrastructure, preserved for the future at least via Zenodo (see below).^[ This also means that several previously published ([open](https://opendefinition.org/licenses/)) data sources have been publicly redistributed at Zenodo. ] An overview of data distribution pathways is given [here](https://drive.google.com/file/d/1xZz9f9n8zSUxBJvW6WEFLyDK7Ya0u4iN/view). [^N2KHAB]: N2KHAB data sources are a list of public, standard data sources, important to analytical workflows concerning Natura 2000 (n2k) habitats (hab) in Flanders. They are in a public repository in order to be easily findable and to be preserved in a durable way. ### More about Zenodo **[Zenodo](https://zenodo.org)** is a scientific repository funded by the European Commission and hosted at CERN. - We prefer Zenodo for its straightforward, easy approach of preserving data sources for the **long term** -- needed for reproducibility -- while providing a stable **DOI** link for each version and for each record as whole. - Managing the N2KHAB data sources at Zenodo allowed us to apply _a uniform and pure representation_ of each data source: we made _one_ data version correspond to _one_ data set following _one_ fileformat (not zipped); we add no other files (e.g. metadata files like pdf's, alternative fileformats etc.). The filenames in these Zenodo records follow the codes of the data sources that are used in the `n2khab` package. - Zenodo storage nicely fits with an internationalized approach of reproducible N2KHAB workflows in R, as its website is in English. ### Local data storage Data sources evolve, and hence, data source versions succeed one another. To ease reproducibility of analytical workflows, this package assumes _locally stored_ data sources. The `n2khab` functions, aimed at reading these data and returning them in R in some kind of standardized way, always provide _arguments_ to specify the file's name and location -- so you can in fact freely choose these. However, to **ease collaboration in scripting**, it is highly recommended to follow the below standard locations and filenames (see: [Getting started](#getting-started)). Moreover, the _functions assume_ these conventions by default in order to make your life easier! There is a major distinction between: - **raw data** ([Zenodo-link](https://zenodo.org/communities/n2khab-data-raw)), to be stored in a directory `n2khab_data/10_raw`; - **processed data** ([Zenodo-link](https://zenodo.org/communities/n2khab-data-processed)), to be stored in a directory `n2khab_data/20_processed`. These data sources have been derived from the raw data sources, but are distributed on their own because of the time-consuming or intricate calculations needed to reproduce them. You can reproduce the processed data sources from a [shell script on Github](https://github.com/inbo/n2khab-preprocessing/blob/master/src/complete_reproducible_workflow.sh), but it will take hours. These binary or large data sources are to be stored in a dedicated directory `n2khab_data` on your system. Don't use this special directory for adding other data. It can reside inside one project or repository but it can also deliver to several projects / repositories; see further. `n2khab_data` should always be ignored by version control systems. ## Getting started for your (collaborative) workflow {#getting-started} Mind that, _if_ you store the `n2khab_data` directory inside a version controlled repository (e.g. using git), it must be **ignored by version control**! 1. Decide **where** you want to store the `n2khab_data` directory: - from the viewpoint of several projects / several git repositories, when these need the same data source versions, the location may be at a high level in your file system. A convenient approach is to use the directory which holds the different project directories / repositories. - from the viewpoint of one project / repository: the `n2khab_data` directory can be put inside the project / repository directory. This approach has the advantage that you can store versions of data sources different from those in another repository (where you also have an `n2khab_data` directory). For the functions to succeed in finding the `n2khab_data` directory in each collaborator's file system, make sure that the directory is present _either in the working directory of your R scripts or in a path at some level above this working directory_. By default, the functions search the directory in that order and use the **first encountered** `n2khab_data` directory. Alternatively, you can set an environment variable `N2KHAB_DATA_PATH` or option `n2khab_data_path` to enforce a specific directory on your system that all `n2khab` functions will use (do that outside the files you collaborate on and share; see `n2khab_options()`). 1. From your working directory, use `fileman_folders()` to specify the desired location (using the function's arguments). It will check the existence of the directories `n2khab_data`, `n2khab_data/10_raw` and `n2khab_data/20_processed` and create them if they don't exist. ```{r eval=FALSE} fileman_folders(root = "rproj") #> Created /n2khab_data #> Created subfolder 10_raw #> Created subfolder 20_processed #> [1] "/n2khab_data" ``` 3. From the cloud storage (links: [raw data](https://zenodo.org/communities/n2khab-data-raw) | [processed data](https://zenodo.org/communities/n2khab-data-processed)), **download** the respective data files of a data source. You can also use the function `download_zenodo()` to do that, using the DOI of each data source version. For each data source, put its file(s) in an appropriate subdirectory either below `n2khab_data/10_raw` or `n2khab_data/20_processed` (depending on the data source). Use the data source's default name for the subdirectory. You get a list of the data source names with _XXX_. These names are version-agnostic! The names of the `n2khab` 'read' function and their documentation make clear which data sources you will need. Below is an example of correctly organised N2KHAB data directories: ``` n2khab_data ├── 10_raw │ ├── habitatmap -> contains habitatmat.shp, habitatmap.dbf etc. │ ├── soilmap │ └── GRTSmaster_habitats └── 20_processed ├── habitatmap_stdized └── GRTSmh_diffres ```