--- title: "Using `n2kanalysis` to analyse monitoring data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using `n2kanalysis` to analyse monitoring data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(knitr) opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Concept We developed `n2kanalysis` to analyse data from species monitoring schemes. 1. A first requirement is to make the analysis **reproducible, traceable and auditable**. Should someone challenge the results of the analysis, we must be able to demonstrate which analysis was run with what data. 1. A second requirement is to make the analysis **portable**. Portable in the sense that we can move analyses to other machines to run them. The feature is most relevant when you have a lot of analyses that take a lot of time. 1. A third requirement is to make it **as efficient as possible**. Only run the new analyses. Don't rerun existing analyses. Note that any change in data or metadata results in a new analysis. The `n2kanalysis` package defines a framework for the analysis. We recommend to define the actual analysis in a dedicated package per monitoring scheme. Two examples are `watervogelanalysis::prepare_analysis()` and `abvanalysis::prepare_analysis()`. ## Workflow The general workflow within `n2kanalysis` is as follows. 1. Import the relevant data from the source. 1. Generate the analysis objects from the imported data. 1. Fit the analysis objects. 1. Extract the results from the analysis objects. ### Import the relevant data Often the source of the data is not under control of the data analyst. Hence the data can change in the source during or after the analysis. Therefore we strongly recommend to import the relevant data from the source and store it in an "analysis database". Use only this database when analysing the data. This offers a stable starting point for the analysis. In case the source changes, you only need to update the import script. The analysis database should be under some kind of version control. ### Generate the analysis objects An `n2kModel` is a generic S4 class which contains all required information to fit the model. The machine on which you create the object needs access to the analysis database. The machine that fit the model in the objects only needs access to the object itself. Splitting multiple analyses over different machines is trivial. ### Fit the models The basic option is to fit the `n2kModel` object in memory. A better option is to fit it from the location where you stored the object. `fit_model()` will update the stored object, making it available for archiving. In case you need to fit several interdependent objects, you can specify the objects in an `n2kManifest` object. Such object lists all `n2kModel` objects to fit and the order in which you want to fit them. The order matters when an `n2kModel` object depends on the output of a different `n2kModel` object. ### Extract the results The fitted `n2kModel` contains the model. Hence the user can extract the required information. `n2kanalysis` offers a `get_result()` function to return all relevant parameters as a `n2kResult` object. A `n2kResult` object contains both the parameters and the metadata. You can combine multiple `n2kResult` objects with `combine_result()` into a single `n2kResult` object and still distinguish the output from the different analyses. `get_result()` also return potential anomalies for `n2k_inla()` models. ## Traceable and auditable Every `n2kModel` object has a so-called file fingerprint and status fingerprint. The file fingerprint is a [SHA-1 hash](https://en.wikipedia.org/wiki/SHA-1) based on everything which will remain stable during the lifetime of the object. Think of the data, model type, model formula, species group, ... Creating the object generates the hash based on the information at the time of creation. We check the validity of the file fingerprint before fitting the model or retrieving results. You can use `validObject()` to do the validation manually. The probability that two `n2kModel` objects with different stable information get the same file fingerprint is very small. Therefore we can reference the `n2kModel` object by its file fingerprint and use it as the filename. You can retrieve this hash with `get_file_fingerprint()`. The status fingerprint is based on the file fingerprint and everything that can change during the lifetime of the object. The most obvious element is the model. You can retrieve this hash with `get_status_fingerprint()`. Reporting both the file fingerprint and status fingerprint along side the results uniquely identifies a specific version of the `n2kModel` object. ## Portable The `n2kModel` object contains all required information to fit the model. The only exception is the case where the object needs information from one or more parent objects. Then you must provide access to those objects as well. Use `store_model()` to store the objects. The `base` argument is either the path of a directory on a file system or an [S3 bucket](https://aws.amazon.com/s3/). The `project` argument defines a sub directory at the root of `base`. Moving the analyses to a different file system is as simple as copying everything in the `project` folder to the other file system. For S3 buckets you only need to provide the machine access to the S3 bucket, no need to copy the files manually. `fit_model()` will read the object from the S3 bucket, fit the model and update the object in the S3 bucket. ## Efficient We mentioned because that generating the object generates as file fingerprint which only depends on the stable element. Use `store_model(overwrite = FALSE)` when you generate objects. When nothing changed, the file fingerprint doesn't change and we keep the existing object which we might have fit earlier. Suppose you need to change the definition of some child objects. That will alter the file fingerprint and thus generate new objects. However the parent objects remain as is. Every object has a `status`. By default `fit_model()` only handles object with status "new" or "waiting". 1. "new": the object is ready for fitting. When required, the object from the parent object is available in the child object. `fit_model()` fits the object. 1. "waiting": the object is missing information from at least one parent object. `fit_model()` updates the information of the parent objects and updates the status accordingly. 1. "converged": `fit_model()` successfully fitted the object. 1. "error": Something when wrong when fitting the model. Or one of the parents has the status "error". Refitting objects with status "error" or "converged" only makes sense after upgrade `n2kanalysis`. ## Metadata Because we need to transfer the `n2kModel` object between machines, it is important to add the necessary metadata to the object. The metadata contains both the file and status fingerprints. ### Definition of the model The model type is a more generic description of the model, whereas the formula is the actual formula used for this specific analysis. The more generic model type allows to group results among species or location, while the formula still allow to use a different formula for some objects. Suppose that the generic model uses both year and month a covariates. When a species is only present during a single month, then it doesn't make sense to include month as a covariate for this species. `seed` sets the seed to make the analysis reproducible. ### Scheme, species and location Other important information in the metadata are the scheme id, the species group id and the location group id. The scheme id refers to the monitoring scheme that delivered the data. The species group id refers to the list of species handled in this analysis. When you analyse a single species, then you need to create a species group containing only this species. Similarly, the location group id refers to a list of locations. The management of the species and location groups is outside of the scope of `n2kanalysis`. We recommend three tables: `speciesgroup`, `species` and `speciesgroup_species`. `speciesgroup` and `species` should contain at least an `id` and a `description` field. `speciesgroup_species` should contain at least a `speciesgroup_id` and a `species_id` field. Hence a species can be part of one or more species groups. Use a similar structure for `locationgroup`, `location` and `locationgroup_location`. `result_datasource_id` refers to the database when you store the definitions of `scheme_id`, `species_group_id` and `location_group_id`. ### Time-stamp Document the available time period with `first_imported_year` and `last_imported_year`. `duration` and `last_analysed_year` refer to the time span for this analysis. The default `duration` is the difference between the last and first imported year. The default `last_analysed_year` is the `last_imported_year`. Use different values in case the analyse covers only a subset of the full dataset. Suppose the full dataset spans from 1992 to 2023. If you want to analyse the period 2011 - 2020, you set `duration = 10` and `last_analysed_year = 2020`. `analysis_date` refers to the moment when you imported the source data. You can use the time-stamp of the commit in case you placed the data under version control with git. `git2rdata::recent_commit()` is a handy function to find the most recent commit that changed a specific file. ### Used packages The metadata contains one or more "analysis versions". An "analysis version" is the list of packages including their version number loading when creating or fitting the model. Fitting the object on a different machine with different R packages than the one you generated the object will result in a second "analysis version". The goal is to document the software used during the process. Store the scripts to generate the objects in a dedicated package. Load the package when running the script. such a workflow documents the code to create the objects. Have a look at [`abvanalysis`](https://github.com/inbo/abvanalysis) and [`watervogelanalysis`](https://github.com/inbo/watervogelanalysis) for some real examples. Both packages contain the code to: 1. Import the data from the source. 1. Create the analysis objects. 1. Get the results from the analysis objects. 1. Display the results in a report or website. ### Linked analyses Optionally you can define one or more parent analyses for the object. Use this when you need to combine the analyses of different species into a multi-species index. Or when you want to analyse trends with multiple imputation. The imputation model is the parent object. The aggregated imputed data is the child object. The analysis of the aggregated imputed data is the grandchild object. ## Available models ### `n2k_import()` A dummy model to document the import step. This object allows to use it a parent object for the subsequent models. ### `n2k_inla()` Fits models with integrated nested Laplacian approximation (INLA). Setting an `imputation_size` larger than 0 (default) performs multiple imputation on the missing values using `multimput::impute()`. In case of multiple imputation you can define a `minimum` covariate. This covariate hold the minimal value to impute. Use this in case of incomplete samples. For example when you count the number of birds in a location and you sampled only a part of the location. The observed count it thus a lower bound of the true value. Set the response variable to missing and the minimum variable to the observed count. Another option is the specify an `extra` dataset of observations. We intended this option for rare observations that might distort the imputation model. For instance due to few observation events at a location, only a few non-zero observations at a location or a few extremely high observations at a location. Exclude the observations from such location for the imputation model and add them to `extra`. You won't get imputations for the missing observation in `extra`. But we use the observed numbers in `extra` in the subsequent aggregation. ### `n2k_inla_comparison()` Compares multiple `n2k_inla()` objects based on the Wantanabe Akaike Information Criterion (WAIC). ### `n2k_composite()` Combines several `n2kModel` objects into a single analysis. For example to combine the results of individual species into a composite species index. The `extractor` function should extract the mean and variance of the parameters. The mean is the fitted object is the mean of the means of every parameters over the models. The standard error is square root of the average variance of every parameter over the models. Hence we assume independence of the parameters between the models. ### `n2k_hurdle_imputed()` Combines two `n2k_inla()` objects with multiple imputation. The `presence` model is an `n2k_inla()` binomial model that describes the presence of the species. The `count` model is an `n2k_inla()` zero-truncated Poisson or zero-truncated negative binomial model that describes the non-zero counts of the species. INLA provides type 0 and type 1 zero-inflated models. The type 0 model defines the likelihood as a point mass at 0 and a zero-truncated distribution. $$Prob(y) = p × 1_{[y=0]} + (1 − p) × \mathcal{P(y | y > 0)}$$ The type 1 model defines the likelihood as a combination of a point at zero and a distribution which can produce zero values too. $$Prob(y) = p × 1_{[y=0]} + (1 − p) × \mathcal{P(y)}$$ The trick to get a zero-truncated distribution is to use a type 0 zero-inflated distribution and fix the probability of the point mass at zero to a very small value. Below is the required INLA setting. Note that the value is expressed on the log scale. `-11` on the log scale is equivalent to $p = 0.0000167$. ``` control.family = list( list(hyper = list(theta2 = list(initial = -11, fixed = TRUE))) ) ``` ### `n2k_aggregate()` The aggregation step after multiple imputation. `fit_model()` will call `multimpute::aggregate_impute()`. Use the `join` argument to select a subset of locations. For example when you ran an imputation model at the country level and you want to aggregate over a region within the country. Note that you can use the same impute model for different aggregation, for example one imputation model at the country level and aggregations for multiple regions. ### `n2k_modelimputed()` Fitting a model an (aggregated) imputed dataset. `fit_model()` will call `multimpute::model_impute()`. You can use `filter` to subset the data after the imputation. This is a useful feature when modelling an aggregation on a smaller subset which might result in leading zero's in the dataset. You can provide a custom filter function to remove those observations from the dataset. The function handles empty datasets after applying the filter. ## Manifest The most basic way to fit the models is to loop over all files. This is suboptimal when some models depend on other models. Since the file name equals the file fingerprint, the order of the file names has no link with the optimal order to fit the models. We solved this problem by introducing the `n2k_manifest()` object. The manifest is simply a dataframe with the file fingerprint of the model and the file fingerprint of the parent model. This information is available after generating the models. Hence you can generate all `n2kModel` objects and store the link between the objects in the manifest. When you need multiple parent objects, add one row of every parent. Applying `fit_model()` on an `n2k_manifest()` fit the models in an optimal order. The initial list of model consist of the models without parents. Then it handles the children of the initial models. Then the grandchildren and so on. This order guarantees that all parent models are fitted before fitting a model. Another relevant use of a manifest is that is bundles a set of models. For example all models for a given monitoring scheme in a specific year. The manifest has is own file fingerprint, useful to reference a specific manifest. ## Docker `n2kanalysis` allows to maximise the reproducibility by making it easier to run the analysis in a Docker container. Use `store_manifest_yaml()` to store the manifest and specify the Docker image plus optional extra dependencies to install. `manifest_yaml_to_bash()` creates a bash script that fits the models one by one in the specified Docker container. Using Docker improves the portability of the analysis as well. When you stored the object in an S3 bucket, you only need to copy the script to the other machine and run it. Or copy both the script and the objects in case you store everything on a file system. `manifest_yaml_to_bash()` offers the option to shut down the machine at the end of the script. This avoids the cost of running an idle server. You can also have it to split the analysis over multiple machines. No fancy parallel computing is involved. Suppose you specified `split = 2`. Then you simply get two scripts, one with the "odd" analyses and one with the "even" analysis.