Missing data can be a not so trivial problem when analysing a dataset and accounting for it is usually not so straightforward either.
If the amount of missing data is very small relatively to the size of the dataset, then leaving out the few samples with missing features may be the best strategy in order not to bias the analysis. However leaving out available datapoints deprives the data of some amount of information and depending on your situation. You may want to look for other fixes before wiping out potentially useful data points from your dataset.
While some quick fixes such as mean-substitution may be fine in some cases, these approaches usually introduce bias into the data.
The MICE package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing datapoint.
Impute missing values using a the airquality dataset. To simplify things, some data will be removed.
data <- airquality
data[4:10,3] <- rep(NA,7)
data[1:5,4] <- NA
Replacing categorical variables is usually not advisable. Some common practice include replacing missing categorical variables with the mode of the observed ones. However, it is questionable whether it is a good choice. Even though in this case no data points are missing from the categorical variables, we remove them from our dataset (we can add them back later if needed) and take a look at the data using summary().
data <- data[-c(5,6)]
summary(data)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :57.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:73.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.806 Mean :78.28
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7 NA's :7 NA's :5
Apparently Ozone is the variable with the most missing datapoints. Below we are going to dig deeper into the missing data patterns.
There are two types of missing data:
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function:
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(data,2,pMiss)
## Ozone Solar.R Wind Temp
## 24.183007 4.575163 4.575163 3.267974
apply(data,1,pMiss)
## [1] 25 25 25 50 100 50 25 25 25 50 25 0 0 0 0 0 0
## [18] 0 0 0 0 0 0 0 25 25 50 0 0 0 0 25 25 25
## [35] 25 25 25 0 25 0 0 25 25 0 25 25 0 0 0 0 0
## [52] 25 25 25 25 25 25 25 25 25 25 0 0 0 25 0 0 0
## [69] 0 0 0 25 0 0 25 0 0 0 0 0 0 0 25 25 0
## [86] 0 0 0 0 0 0 0 0 0 0 25 25 25 0 0 0 25
## [103] 25 0 0 0 25 0 0 0 0 0 0 0 25 0 0 0 25
## [120] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [137] 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 0 0
We see that Ozone is missing almost 25% of the datapoints, therefore we might consider either dropping it from the analysis or gather more measurements. The other variables are below the 5% threshold so we can keep them. As far as the samples are concerned, missing just one feature leads to a 25% missing data per sample. Samples that are missing 2 or more features (>50%), should be dropped if possible.
The mice package provides a helpful function md.pattern()
providing a better understanding of the pattern of missing data.
library(mice)
md.pattern(data)
## Temp Solar.R Wind Ozone
## 104 1 1 1 1 0
## 34 1 1 1 0 1
## 4 1 0 1 1 1
## 3 1 1 0 1 1
## 3 0 1 1 1 1
## 1 1 0 1 0 2
## 1 1 1 0 0 2
## 1 1 0 0 1 2
## 1 0 1 0 1 2
## 1 0 0 0 0 4
## 5 7 7 37 56
A perhaps more helpful visual representation can be obtained using the VIM package as follows:
library(VIM)
aggr_plot <- aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE,
labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Ozone 0.24183007
## Solar.R 0.04575163
## Wind 0.04575163
## Temp 0.03267974
The plot helps us understand:
Another visual approach is a special box plot:
marginplot(data[c(1,2)])
We are constrained at plotting 2 variables at a time only, but we can still collect some interesting insights.
The red box plot on the left shows the distribution of Solar.R with Ozone missing while the blue box plot shows the distribution of the remaining datapoints. Likewhise for the Ozone box plots at the bottom of the graph.
If our assumption of MCAR data is correct, then we expect the red and blue box plots to be very similar.
The mice() function takes care of the imputing process:
library(mice)
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
## Multiply imputed data set
## Call:
## mice(data = data, m = 5, method = "pmm", maxit = 50, seed = 500)
## Number of multiple imputations: 5
## Missing cells per column:
## Ozone Solar.R Wind Temp
## 37 7 7 5
## Imputation methods:
## Ozone Solar.R Wind Temp
## "pmm" "pmm" "pmm" "pmm"
## VisitSequence:
## Ozone Solar.R Wind Temp
## 1 2 3 4
## PredictorMatrix:
## Ozone Solar.R Wind Temp
## Ozone 0 1 1 1
## Solar.R 1 0 1 1
## Wind 1 1 0 1
## Temp 1 1 1 0
## Random generator seed value: 500
A couple of notes on the parameters:
If you would like to check the imputed data, enter the following line of code:
head(tempData$imp$Ozone)
## 1 2 3 4 5
## 5 13 20 28 12 9
## 10 7 16 28 14 20
## 25 8 14 14 1 8
## 26 9 19 32 8 37
## 27 4 13 10 18 14
## 32 16 65 46 46 40
The output shows the imputed data for each observation (first column left) within each imputed dataset (first row at the top).
If you need to check the imputation method used for each variable, MICE makes it very easy to do:
tempData$meth
## Ozone Solar.R Wind Temp
## "pmm" "pmm" "pmm" "pmm"
Now we can get back the completed dataset using the complete()
function.
completedData <- complete(tempData,1)
The missing values have been replaced with the imputed values in the first of the five datasets. If you wish to use another one, just change the second parameter in the complete()
function.
Lets compare the distributions of original and imputed data using a some useful plots.
Use a scatterplot and plot Ozone against all the other variables:
library(lattice)
xyplot(tempData, Ozone ~ Wind + Temp + Solar.R, pch = 18, cex = 1)
We expect to see is that the shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed “plausible values”.
Another helpful plot is the density plot:
densityplot(tempData)
The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. Again, under our previous assumptions we expect the distributions to be similar.
Another useful visual take on the distributions can be obtained using the stripplot()
function that shows the distributions of the variables as individual points:
stripplot(tempData, pch = 20, cex = 1.2)
Fit a linear model to the data. The MICE package makes it easy to fit a a model to each of the imputed dataset and then pool the results together:
modelFit1 <- with(tempData, lm(Temp ~ Ozone + Solar.R +Wind))
summary(pool(modelFit1))
## est se t df Pr(>|t|)
## (Intercept) 72.83196838 2.984476953 24.403596 43.53891 0.000000e+00
## Ozone 0.16342203 0.024820514 6.584152 97.96154 2.282917e-09
## Solar.R 0.01040475 0.008775564 1.185651 18.75859 2.505626e-01
## Wind -0.33039675 0.218663873 -1.510980 75.90126 1.349458e-01
## lo 95 hi 95 nmis fmi lambda
## (Intercept) 66.815350668 78.84858610 NA 0.2698805 0.2370948
## Ozone 0.114166289 0.21267778 37 0.1203435 0.1025658
## Solar.R -0.007978725 0.02878823 7 0.4632472 0.4089161
## Wind -0.765912758 0.10511926 7 0.1662491 0.1445654
The variable modelFit1 containts the results of the fitting performed over the imputed datasets. The pool()
function pools them all together. The Ozone variable is statistically significant. Note that there are other columns aside from those typical of the lm()
model: fmi
contains the fraction of missing information while lambda
is the proportion of total variance that is attributable to the missing data.
We initialized the mice function with a specific seed, therefore the results are somewhat dependent on our initial choice. To reduce this effect, we can impute a higher number of dataset, by changing the default m = 5 parameter in the MICE() function.
tempData2 <- mice(data, m = 50)
modelFit2 <- with(tempData2, lm(Temp ~ Ozone + Solar.R + Wind))
summary(pool(modelFit2))
## est se t df Pr(>|t|)
## (Intercept) 72.77049386 2.845156817 25.576971 115.5163 0.000000e+00
## Ozone 0.16142418 0.026183862 6.165025 101.3380 1.448599e-08
## Solar.R 0.01226693 0.007185263 1.707235 120.1669 9.036100e-02
## Wind -0.35035345 0.214336833 -1.634593 123.3348 1.046843e-01
## lo 95 hi 95 nmis fmi lambda
## (Intercept) 67.135053603 78.40593412 NA 0.1764771 0.1623413
## Ozone 0.109484536 0.21336382 37 0.2426855 0.2278853
## Solar.R -0.001959189 0.02649305 7 0.1548696 0.1409198
## Wind -0.774608631 0.07390173 7 0.1400233 0.1261900
We about the same results as before with only Ozone showing statistical significance.