Impute Missing Data - The Basics

Brief Introduction to Imputing Missing Data

This is just a quick document. A more thorough one will be developed. Sometimes short and quick-read docs is all oyu need to get through a problem you might be researching.

Again. make some data.

#Make up some data
income <- round(runif(100, min = 35000, max = 350000), 0)
age <- round(runif(100, min=18, max=72), 0)
myData <- data.frame(age, income)
noise <- round(runif(100, min = 1500, max = 15000), 0)
myData$income <- myData$income + noise
myData <- arrange(myData, desc(income))
myData$education <- as.factor(sample(c("High School", "Bachelors", "Masters", "Doctorate"), 100, replace = TRUE, prob =c(0.7, 0.15, 0.12, 0.03) ))
head(myData, 5)

##   age income   education
## 1  64 360674 High School
## 2  33 357498 High School
## 3  51 356911 High School
## 4  60 353922 High School
## 5  55 350325 High School

#add some missing data this time
myData$age[sample(1:nrow(myData),15)] <- NA
myData$income[sample(1:nrow(myData),10)] <- NA
myData$education[sample(1:nrow(myData),10)] <- NA
summary(myData)

##       age            income             education 
##  Min.   :18.00   Min.   : 45841   Bachelors  :11  
##  1st Qu.:28.00   1st Qu.:130682   Doctorate  : 4  
##  Median :45.00   Median :216022   High School:64  
##  Mean   :43.75   Mean   :212358   Masters    :11  
##  3rd Qu.:57.00   3rd Qu.:297378   NA's       :10  
##  Max.   :71.00   Max.   :360674                   
##  NA's   :15      NA's   :10

Impute missing values with median/mode

Use impute() with imputeMissings to impute missing values with mdedian/mode. This method is simple and fast but treats each predictor independently and may not be 100% accurate.

myDataImputed1 <- impute(myData, method = "median/mode")
summary(myDataImputed1)

##       age            income             education 
##  Min.   :18.00   Min.   : 45841   Bachelors  :11  
##  1st Qu.:30.50   1st Qu.:141858   Doctorate  : 4  
##  Median :45.00   Median :216022   High School:74  
##  Mean   :43.94   Mean   :212724   Masters    :11  
##  3rd Qu.:55.00   3rd Qu.:290346                   
##  Max.   :71.00   Max.   :360674

The median/mode method imputes mode to character vectors and median to numeric and integer vectors. You see the 10 missing values for variable “education” are imputed with “High School” since it is the mode.

You can also use preProcess() but it only works for numeric variables.

myDataImputed2 <- preProcess(myData[, c("income", "age")], method = "medianImpute")
myDataImputed2 <- predict(myDataImputed2, myData[, c("income", "age")])
summary(myDataImputed2)

##      income            age       
##  Min.   : 45841   Min.   :18.00  
##  1st Qu.:141858   1st Qu.:30.50  
##  Median :216022   Median :45.00  
##  Mean   :212724   Mean   :43.94  
##  3rd Qu.:290346   3rd Qu.:55.00  
##  Max.   :360674   Max.   :71.00

Impute missing values based on K-nearest neighbors

k-nearest neighbor will find the k closest samples in the training set and impute the mean of those neighbors.

This method considers all predictors together. It requires them to be in the same scale since the euclidian distance is used.

myDataImputed3 <- preProcess(myData[, c("income", "age")], method = "knnImpute", k=2)
myDataImputed3 <- predict(myDataImputed3, myData[, c("income", "age")])

Error in FUN(newX[, i], …) : cannot impute when all predictors are missing in the new data point

We get an error saying cannot impute when all predictors are missing in the new data point. It is because there is at least one sample with both “income” and “age” missing. We can delete these and do it again.

myBadDataRows <- which(is.na(myData$income) & is.na(myData$age))

myDataImputed3 <- preProcess(myData[-myBadDataRows, c("income", "age")],method = "knnImpute", k=2)
myDataImputed3 <- predict(myDataImputed3,myData[-myBadDataRows, c("income", "age")])
summary(myDataImputed3)

##      income             age          
##  Min.   :-1.6977   Min.   :-1.54982  
##  1st Qu.:-0.7311   1st Qu.:-0.88783  
##  Median : 0.1676   Median : 0.07505  
##  Mean   : 0.0371   Mean   : 0.02733  
##  3rd Qu.: 0.8213   3rd Qu.: 0.76712  
##  Max.   : 1.5121   Max.   : 1.63973

The error doesn’t show up this time. This method considers all predictors together but it requires them to be in the same scale since the “euclidian distance” is used to find the neighbours.