This package offers a number of commonly used single imputation methods, each with a similar and hopefully simple interface. At the moment the following imputation methodology is supported.
The latest release of the package can be installed as follows.
This package is a wrapper package. It stands on the shoulders of some
great packages that other authors have provided. Below is an overview of
the packages that make imputation with simputation
possible.
| function. | model | package | R.recommended | 
|---|---|---|---|
| impute_rlm | M-estimation | MASS | yes | 
| impute_en | ridge/elasticnet/lasso | glmnet | no | 
| impute_cart | CART | rpart | yes | 
| impute_rf | random forest | randomForest | no | 
| impute_rhd | random hot deck | VIM (optional) | no | 
| impute_shd | sequential hot deck | VIM (optional) | no | 
| impute_knn | k nearest neighbours | VIM (optional) | no | 
| impute_mf | missForest | missForest | no | 
| impute_em | mv-normal | norm | no | 
A call to an imputation function has the following structure.
The output is similar to the data argument, except that
empty values are imputed (where possible) using the specified model.
The formula argument speciefies the variables to be
imputed, the model specification for <model> and
possibly the grouping of the dataset. The structure of a formula object
is as follows:
where the part between [] is optional.
In the following, we assume that the reader already has some familiarity with the use of formulas in R (e.g. when specifying linear models) and statistical models commonly used in imputation.
First create a copy of the iris dataset with some empty values in
columns 1 (Sepal.Length), 2 (Sepal.Width) and
5 (Species).
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            NA         3.5          1.4         0.2  setosa
## 2            NA         3.0          1.4         0.2  setosa
## 3            NA          NA          1.3         0.2  setosa
## 4           4.6          NA          1.5         0.2  setosa
## 5           5.0          NA          1.4         0.2  setosa
## 6           5.4          NA          1.7         0.4  setosa
## 7           4.6          NA          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2    <NA>
## 9           4.4         2.9          1.4         0.2    <NA>
## 10          4.9         3.1          1.5         0.1    <NA>To impute Sepal.Length using a linear model use the
impute_lm function.
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.076579         3.5          1.4         0.2  setosa
## 2     4.675654         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosaObserve that the 3rd value is not imputed. This is because one of the
predictor variables is missing so the linear model does not produce an
output. simputation does not report such cases but simply
returns the partly imputed result. The remaining value can be imputed
using a new linear model or as shown below, using the group median.
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.076579         3.5          1.4         0.2  setosa
## 2     4.675654         3.0          1.4         0.2  setosa
## 3     5.000000          NA          1.3         0.2  setosaHere, Species is used to group the data before computing
the medians.
Finally, we impute the Species variable using a decision
tree model. All variables except Species are used as
predictor.
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1      5.076579         3.5          1.4         0.2  setosa
## 2      4.675654         3.0          1.4         0.2  setosa
## 3      5.000000          NA          1.3         0.2  setosa
## 4      4.600000          NA          1.5         0.2  setosa
## 5      5.000000          NA          1.4         0.2  setosa
## 6      5.400000          NA          1.7         0.4  setosa
## 7      4.600000          NA          1.4         0.3  setosa
## 8      5.000000         3.4          1.5         0.2  setosa
## 9      4.400000         2.9          1.4         0.2  setosa
## 10     4.900000         3.1          1.5         0.1  setosaUsing the |> operator (R 4.0.0 or later) allows for a
very compact specification of the above examples.
The simputation package allows users to specify an imputation model
for multiple variables at once. For example, to impute both
Sepal.Length and Sepal.Width with a similar
robust linear model, do the following.
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     4.945416    3.500000          1.4         0.2  setosa
## 2     4.945416    3.000000          1.4         0.2  setosa
## 3     4.854056    3.378979          1.3         0.2  setosa
## 4     4.600000    3.440107          1.5         0.2  setosa
## 5     5.000000    3.409543          1.4         0.2  setosa
## 6     5.400000    3.501236          1.7         0.4  setosaThe function will model Sepal.Length and
Sepal.Width against the predictor variables independently
and impute them. The order of variables in the specification is
therefore not important for the result.
In general, the left-hand side of the model formula is analyzed by
simputation, combined appropriately with the right hand
side and then passed through to the underlying modeling routine.
Simputation also understands the "." syntax, which stands
for “every variable not otherwise present” and the “-” sign to remove
variables from a formula. For example, the next expression imputes every
variable except Species with the group mean plus a normally
distributed random residual.
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     4.951849    3.500000          1.4         0.2  setosa
## 2     5.568217    3.000000          1.4         0.2  setosa
## 3     5.487856    3.321004          1.3         0.2  setosa
## 4     4.600000    3.449319          1.5         0.2  setosa
## 5     5.000000    2.644668          1.4         0.2  setosa
## 6     5.400000    3.403392          1.7         0.4  setosawhere Species on the right-hand-side defines the
grouping variable.
Use | in the formula argument to specify
groups.
# New data set, leaving Species intact
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
# split dat into groups according to 'Species', impute, combine and return.
da8 <- impute_lm(dat, Sepal.Length ~ Petal.Width | Species)
head(da8)##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     4.968092         3.5          1.4         0.2  setosa
## 2     4.968092         3.0          1.4         0.2  setosa
## 3     4.968092          NA          1.3         0.2  setosa
## 4     4.600000          NA          1.5         0.2  setosa
## 5     5.000000          NA          1.4         0.2  setosa
## 6     5.400000          NA          1.7         0.4  setosaIf one or more grouping variables are specified (multiple are
specified by separating them with +), imputation takes
place as follows.
Simputation also integrates with the dplyr package and
recognizes grouping specified with group_by.
The impute_proxy function is somewhat special since it
allows you to define an imputation method in the right-hand-side of the
formula object. Below we implement a `robust ratio imputation’ (for what
its worth) as example.
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- NA
dat <- impute_proxy(dat, Sepal.Length ~ median(Sepal.Length,na.rm=TRUE)/median(Sepal.Width, na.rm=TRUE) * Sepal.Width | Species)
head(dat)##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.147059         3.5          1.4         0.2  setosa
## 2     4.411765         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa
## 4     4.600000          NA          1.5         0.2  setosa
## 5     5.000000          NA          1.4         0.2  setosa
## 6     5.400000          NA          1.7         0.4  setosaThis can be done with the impute function. To use it,
train your model in the way you are used to.
Next, use this model to impute a dataset.
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           NA         3.5          1.4         0.2  setosa
## 2           NA         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa
## 4          4.6          NA          1.5         0.2  setosa
## 5          5.0          NA          1.4         0.2  setosa
## 6          5.4          NA          1.7         0.4  setosa##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.063856         3.5          1.4         0.2  setosa
## 2     4.662076         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa
## 4     4.600000          NA          1.5         0.2  setosa
## 5     5.000000          NA          1.4         0.2  setosa
## 6     5.400000          NA          1.7         0.4  setosaThat’s really all there is to it.
The VIM package offers fast implementations for sequential and random hotdeck procedures (based on the data.table package). It also offers somewhat finer control over certain features such as donor selection. For this reason, the sequential, random, and k-nearest neighbours hotdeck imputation procedures can be told to use VIM as backend.
dat <- data.frame(
  foo = c(1,2,NA,4)
  , bar = c(1,NA,8,NA)
)
# sequential hotdeck imputation, no sorting variables
impute_shd(dat, . ~ 1, pool="complete")
impute_shd(dat, . ~ 1, pool="univariate")
impute_shd(dat, .~1, backend="VIM")Note that VIM uses last observation carried forward by default, and
the specification of donor pool is on a per-variable basis (this cannot
be changed). See ?impute_shd for the full
specification.