This is just a quick document. A more thorough one will be developed. Sometimes short and quick-read docs is all oyu need to get through a problem you might be researching.
Again. make some data.
#Make up some data
income <- round(runif(100, min = 35000, max = 350000), 0)
age <- round(runif(100, min=18, max=72), 0)
myData <- data.frame(age, income)
noise <- round(runif(100, min = 1500, max = 15000), 0)
myData$income <- myData$income + noise
myData <- arrange(myData, desc(income))
myData$education <- as.factor(sample(c("High School", "Bachelors", "Masters", "Doctorate"), 100, replace = TRUE, prob =c(0.7, 0.15, 0.12, 0.03) ))
head(myData, 5)
## age income education
## 1 64 360674 High School
## 2 33 357498 High School
## 3 51 356911 High School
## 4 60 353922 High School
## 5 55 350325 High School
#add some missing data this time
myData$age[sample(1:nrow(myData),15)] <- NA
myData$income[sample(1:nrow(myData),10)] <- NA
myData$education[sample(1:nrow(myData),10)] <- NA
summary(myData)
## age income education
## Min. :18.00 Min. : 45841 Bachelors :11
## 1st Qu.:28.00 1st Qu.:130682 Doctorate : 4
## Median :45.00 Median :216022 High School:64
## Mean :43.75 Mean :212358 Masters :11
## 3rd Qu.:57.00 3rd Qu.:297378 NA's :10
## Max. :71.00 Max. :360674
## NA's :15 NA's :10
Use impute()
with imputeMissings
to impute missing values with mdedian/mode. This method is simple and fast but treats each predictor independently and may not be 100% accurate.
myDataImputed1 <- impute(myData, method = "median/mode")
summary(myDataImputed1)
## age income education
## Min. :18.00 Min. : 45841 Bachelors :11
## 1st Qu.:30.50 1st Qu.:141858 Doctorate : 4
## Median :45.00 Median :216022 High School:74
## Mean :43.94 Mean :212724 Masters :11
## 3rd Qu.:55.00 3rd Qu.:290346
## Max. :71.00 Max. :360674
The median/mode method imputes mode to character vectors and median to numeric and integer vectors. You see the 10 missing values for variable “education” are imputed with “High School” since it is the mode.
You can also use preProcess()
but it only works for numeric variables.
myDataImputed2 <- preProcess(myData[, c("income", "age")], method = "medianImpute")
myDataImputed2 <- predict(myDataImputed2, myData[, c("income", "age")])
summary(myDataImputed2)
## income age
## Min. : 45841 Min. :18.00
## 1st Qu.:141858 1st Qu.:30.50
## Median :216022 Median :45.00
## Mean :212724 Mean :43.94
## 3rd Qu.:290346 3rd Qu.:55.00
## Max. :360674 Max. :71.00
k-nearest neighbor will find the k closest samples in the training set and impute the mean of those neighbors.
This method considers all predictors together. It requires them to be in the same scale since the euclidian distance is used.
myDataImputed3 <- preProcess(myData[, c("income", "age")], method = "knnImpute", k=2)
myDataImputed3 <- predict(myDataImputed3, myData[, c("income", "age")])
Error in FUN(newX[, i], …) : cannot impute when all predictors are missing in the new data point
We get an error saying cannot impute when all predictors are missing in the new data point. It is because there is at least one sample with both “income” and “age” missing. We can delete these and do it again.
myBadDataRows <- which(is.na(myData$income) & is.na(myData$age))
myDataImputed3 <- preProcess(myData[-myBadDataRows, c("income", "age")],method = "knnImpute", k=2)
myDataImputed3 <- predict(myDataImputed3,myData[-myBadDataRows, c("income", "age")])
summary(myDataImputed3)
## income age
## Min. :-1.6977 Min. :-1.54982
## 1st Qu.:-0.7311 1st Qu.:-0.88783
## Median : 0.1676 Median : 0.07505
## Mean : 0.0371 Mean : 0.02733
## 3rd Qu.: 0.8213 3rd Qu.: 0.76712
## Max. : 1.5121 Max. : 1.63973
The error doesn’t show up this time. This method considers all predictors together but it requires them to be in the same scale since the “euclidian distance” is used to find the neighbours.