Title: | Using Multiple Imputation to Address Missing Data |
---|---|
Description: | Accompanying package for the paper: Working with population totals in the presence of missing data comparing imputation methods in terms of bias and precision. Published in 2017 in the Journal of Ornithology volume 158 page 603–615 (<doi:10.1007/s10336-016-1404-9>). |
Authors: | Thierry Onkelinx [aut, cre] (<https://orcid.org/0000-0001-8804-4216>, Research Institute for Nature and Forest (INBO)), Koen Devos [aut] (<https://orcid.org/0000-0001-7265-6349>, Research Institute for Nature and Forest (INBO)), Paul Quataert [aut] , Research Institute for Nature and Forest (INBO) [cph, fnd] |
Maintainer: | Thierry Onkelinx <[email protected]> |
License: | GPL-3 |
Version: | 0.2.14 |
Built: | 2024-12-03 06:14:33 UTC |
Source: | https://github.com/inbo/multimput |
Aggregate an imputed dataset
aggregate_impute(object, grouping, fun, filter = list(), join) ## S4 method for signature 'ANY' aggregate_impute(object, grouping, fun, filter = list(), join) ## S4 method for signature 'rawImputed' aggregate_impute(object, grouping, fun, filter = list(), join) ## S4 method for signature 'aggregatedImputed' aggregate_impute(object, grouping, fun, filter = list(), join)
aggregate_impute(object, grouping, fun, filter = list(), join) ## S4 method for signature 'ANY' aggregate_impute(object, grouping, fun, filter = list(), join) ## S4 method for signature 'rawImputed' aggregate_impute(object, grouping, fun, filter = list(), join) ## S4 method for signature 'aggregatedImputed' aggregate_impute(object, grouping, fun, filter = list(), join)
object |
A |
grouping |
A vector of variables names to group the aggregation on. |
fun |
The function to aggregate. |
filter |
An optional argument to filter the raw dataset before aggregation.
Will be passed to |
join |
An optional argument to filter the raw dataset based on a data.frame.
A |
dataset <- generate_data(n_year = 10, n_site = 50, n_run = 1) dataset$Count[sample(nrow(dataset), 50)] <- NA model <- lm(Count ~ Year + factor(Period) + factor(Site), data = dataset) imputed <- impute(data = dataset, model = model) aggregate_impute(imputed, grouping = c("Year", "Period"), fun = sum)
dataset <- generate_data(n_year = 10, n_site = 50, n_run = 1) dataset$Count[sample(nrow(dataset), 50)] <- NA model <- lm(Count ~ Year + factor(Period) + factor(Site), data = dataset) imputed <- impute(data = dataset, model = model) aggregate_impute(imputed, grouping = c("Year", "Period"), fun = sum)
aggregatedImputed
class
Holds an aggregated imputation data setThe aggregatedImputed
class
Holds an aggregated imputation data set
Covariate
A data.frame with the covariates.
Imputation
A matrix with aggregated imputed values.
Generate data for a regular monitoring design. The counts follow a negative binomial distribution with given size parameters and the true mean mu depending on a year, period and site effect. All effects are independent from each other and have, on the log-scale, a normal distribution with zero mean and given standard deviation.
generate_data( intercept = 2, n_year = 24, n_period = 6, n_site = 20, year_factor = FALSE, period_factor = FALSE, site_factor = FALSE, trend = 0.01, sd_rw_year = 0.1, amplitude_period = 1, mean_phase_period = 0, sd_phase_period = 0.2, sd_site = 1, sd_rw_site = 0.02, sd_noise = 0.01, size = 2, n_run = 10, as_list = FALSE, details = FALSE )
generate_data( intercept = 2, n_year = 24, n_period = 6, n_site = 20, year_factor = FALSE, period_factor = FALSE, site_factor = FALSE, trend = 0.01, sd_rw_year = 0.1, amplitude_period = 1, mean_phase_period = 0, sd_phase_period = 0.2, sd_site = 1, sd_rw_site = 0.02, sd_noise = 0.01, size = 2, n_run = 10, as_list = FALSE, details = FALSE )
intercept |
The global mean on the log-scale. |
n_year |
The number of years. |
n_period |
The number of periods. |
n_site |
The number of sites. |
year_factor |
Convert year to a factor.
Defaults to |
period_factor |
Convert period to a factor.
Defaults to |
site_factor |
Convert site to a factor.
Defaults to |
trend |
The long-term linear trend on the log-scale. |
sd_rw_year |
The standard deviation of the year effects on the log-scale. |
amplitude_period |
The amplitude of the periodic effect on the log-scale. |
mean_phase_period |
The mean of the phase of the periodic effect among
years.
Defaults to |
sd_phase_period |
The standard deviation of the phase of the periodic effect among years. |
sd_site |
The standard deviation of the site effects on the log-scale. |
sd_rw_site |
The standard deviation of the random walk along year per site on the log-scale. |
sd_noise |
The standard deviation of the noise effects on the log-scale. |
size |
The size parameter of the negative binomial distribution. |
n_run |
The number of runs with the same mu. |
as_list |
Return the dataset as a list rather than a data.frame.
Defaults to |
details |
Add variables containing the year, period and site effects.
Defaults tot |
A data.frame
with five variables.
Year
, Month
and Site
are factors identifying the location and time of
monitoring.
Mu
is the true mean of the negative binomial distribution in the original
scale.
Count
are the simulated counts.
Multiplies the imputed values for the presence
model with those of the
count
model.
Please make sure that the order of the observations in both models is
identical.
The resulting object will contain the union of the covariates of both models.
Variables with the same name and different values get a presence_
or
count_
prefix.
hurdle_impute(presence, count)
hurdle_impute(presence, count)
presence |
the |
count |
the |
Impute a dataset
impute(model, ..., extra, n_imp = 19) ## S4 method for signature 'ANY' impute(model, ..., extra, n_imp = 19) ## S4 method for signature 'glmerMod' impute(model, data, ..., extra, n_imp) ## S4 method for signature 'maybeInla' impute( model, ..., seed = 0L, num_threads = NULL, parallel_configs = TRUE, extra, n_imp = 19 ) ## S4 method for signature 'lm' impute(model, data, ..., extra, n_imp)
impute(model, ..., extra, n_imp = 19) ## S4 method for signature 'ANY' impute(model, ..., extra, n_imp = 19) ## S4 method for signature 'glmerMod' impute(model, data, ..., extra, n_imp) ## S4 method for signature 'maybeInla' impute( model, ..., seed = 0L, num_threads = NULL, parallel_configs = TRUE, extra, n_imp = 19 ) ## S4 method for signature 'lm' impute(model, data, ..., extra, n_imp)
model |
model to impute the dataset |
... |
other arguments. See details |
extra |
a |
n_imp |
the number of imputations.
Defaults to |
data |
The dataset holding both the observed and the missing values |
seed |
See the same argument in |
num_threads |
The number of threads to use in the format |
parallel_configs |
Logical.
If TRUE and not on Windows, then try to run each configuration in parallel
(not Windows) using |
dataset <- generate_data(n_year = 10, n_site = 50, n_run = 1) dataset$Count[sample(nrow(dataset), 50)] <- NA model <- lm(Count ~ Year + factor(Period) + factor(Site), data = dataset) impute(model, dataset)
dataset <- generate_data(n_year = 10, n_site = 50, n_run = 1) dataset$Count[sample(nrow(dataset), 50)] <- NA model <- lm(Count ~ Year + factor(Period) + factor(Site), data = dataset) impute(model, dataset)
The observed values will be either equal to the counts or missing. The probability of missing is the inverse of the counts + 1.
missing_at_random( dataset, proportion = 0.25, count_variable = "Count", observed_variable = "Observed" )
missing_at_random( dataset, proportion = 0.25, count_variable = "Count", observed_variable = "Observed" )
dataset |
A dataset to a the observation with missing data. |
proportion |
The proportion of observations that will be missing. |
count_variable |
The name of the variable holding the counts. |
observed_variable |
The name of the variable holding the observed values = either count or missing. |
The observed values will be either equal to the counts or missing. The probability of missing is the inverse of the counts + 1.
missing_current_count( dataset, proportion = 0.25, count_variable = "Count", observed_variable = "Observed" )
missing_current_count( dataset, proportion = 0.25, count_variable = "Count", observed_variable = "Observed" )
dataset |
A dataset to a the observation with missing data. |
proportion |
The proportion of observations that will be missing. |
count_variable |
The name of the variable holding the counts. |
observed_variable |
The name of the variable holding the observed values = either count or missing. |
The observed values will be either equal to the counts or missing. The probability of missing is the inverse of the counts + 1.
missing_observed( dataset, count_variable = "Count", observed_variable = "Observed", site_variable = "Site", year_variable = "Year", period_variable = "Period" )
missing_observed( dataset, count_variable = "Count", observed_variable = "Observed", site_variable = "Site", year_variable = "Year", period_variable = "Period" )
dataset |
A dataset to a the observation with missing data. |
count_variable |
The name of the variable holding the counts. |
observed_variable |
The name of the variable holding the observed values = either count or missing. |
site_variable |
The name of the variable holding the sites. |
year_variable |
The name of the variable holding the years. |
period_variable |
The name of the variable holding the period. |
The observed values will be either equal to the counts or missing. The probability of missing is the inverse of the counts + 1.
missing_volunteer( dataset, proportion = 0.25, count_variable = "Count", observed_variable = "Observed", year_variable = "Year", site_variable = "Site", max_count = 100 )
missing_volunteer( dataset, proportion = 0.25, count_variable = "Count", observed_variable = "Observed", year_variable = "Year", site_variable = "Site", max_count = 100 )
dataset |
A dataset to a the observation with missing data. |
proportion |
The proportion of observations that will be missing. |
count_variable |
The name of the variable holding the counts. |
observed_variable |
The name of the variable holding the observed values = either count or missing. |
year_variable |
The name of the variable holding the years. |
site_variable |
The name of the variable holding the sites. |
max_count |
The maximum count. |
Model an imputed dataset
model_impute( object, model_fun, rhs, model_args = list(), extractor, extractor_args = list(), filter = list(), mutate = list(), ..., timeout = 600 ) ## S4 method for signature 'ANY' model_impute( object, model_fun, rhs, model_args = list(), extractor, extractor_args = list(), filter = list(), mutate = list(), ..., timeout = 600 ) ## S4 method for signature 'aggregatedImputed' model_impute( object, model_fun, rhs, model_args = list(), extractor, extractor_args = list(), filter = list(), mutate = list(), ..., timeout = 600 )
model_impute( object, model_fun, rhs, model_args = list(), extractor, extractor_args = list(), filter = list(), mutate = list(), ..., timeout = 600 ) ## S4 method for signature 'ANY' model_impute( object, model_fun, rhs, model_args = list(), extractor, extractor_args = list(), filter = list(), mutate = list(), ..., timeout = 600 ) ## S4 method for signature 'aggregatedImputed' model_impute( object, model_fun, rhs, model_args = list(), extractor, extractor_args = list(), filter = list(), mutate = list(), ..., timeout = 600 )
object |
The imputed dataset. |
model_fun |
The function to apply on each imputation set.
Or a string with the name of the function.
Include the package name when the function is not in one of the base R
packages.
For example: |
rhs |
The right hand side of the model. |
model_args |
An optional list of arguments to pass to the model function. |
extractor |
A function which return a |
extractor_args |
An optional list of arguments to pass to the |
filter |
An optional argument to filter the aggregated dataset.
Either a function which takes the |
mutate |
An optional argument to alter the aggregated dataset.
Will be passed to the |
... |
currently ignored. |
timeout |
Maximum duration allowed for fitting a single imputation
model in seconds.
Defaults to |
dataset <- generate_data(n_year = 10, n_site = 50, n_run = 1) dataset$Count[sample(nrow(dataset), 50)] <- NA model <- lm(Count ~ Year + factor(Period) + factor(Site), data = dataset) imputed <- impute(data = dataset, model = model) aggr <- aggregate_impute(imputed, grouping = c("Year", "Period"), fun = sum) extractor <- function(model) { summary(model)$coefficients[, c("Estimate", "Std. Error")] } model_impute( object = aggr, model_fun = lm, rhs = "0 + factor(Year)", extractor = extractor )
dataset <- generate_data(n_year = 10, n_site = 50, n_run = 1) dataset$Count[sample(nrow(dataset), 50)] <- NA model <- lm(Count ~ Year + factor(Period) + factor(Site), data = dataset) imputed <- impute(data = dataset, model = model) aggr <- aggregate_impute(imputed, grouping = c("Year", "Period"), fun = sum) extractor <- function(model) { summary(model)$coefficients[, c("Estimate", "Std. Error")] } model_impute( object = aggr, model_fun = lm, rhs = "0 + factor(Year)", extractor = extractor )
rawImputed
class
Holds a dataset and imputed valuesThe rawImputed
class
Holds a dataset and imputed values
Data
A data.frame with the data.
Response
A character holding the name of the response variable.
Minimum
An optional character holding the name of the variable with the minimum.
Imputation
A matrix with imputed values.
Extra
A data.frame with extra data to add to the imputations. This data is not used in the imputation model. It must contain the same variables as the original data.
Data for fig 1 and 2 in Onkelinx et al
data(waterfowl)
data(waterfowl)
A data frame with 77157 rows and 5 variables
Site
Site ID.
Winter
Winter ID.
Period
ID of the month.
Species
Number of observed species.
Birds
Total number of birds.