Preprocessign Data with caret

Introduction
Pre-Processing
Appendix
References

Introduction

caret, short for _C_lassification _A_nd _RE_gression _T_raining, is a set of functions that streamline the process for creating predictive models. Its two chief benefits are a uniform interface and standardization of common tasks. The package contains tools for:

data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation

Caret Package Cheatsheet

This document provides examples of caret preprocessing features.

Pre-Processing

caret (Classification And Regression Training ) includes several functions to pre-process the predictor data. caretassumes that all of the data are numeric (i.e. factors have been converted to dummy variables via model.matrix, dummyVars or other means).

Data Splitting
Dummy Variables
Zero- and Near Zero-Variance Predictors
Identifying Correlated Predictors
Linear Dependencies
Centering and Scaling
Imputation
Transforming Predictors
Class Distance Calculations

You may want to use parallel processing

Setup parallel processing (assuming 6 cores in the example):

library(doParallel) rCluster <- makePSOCKcluster(6) registerDoParallel(rCluster)

Split Data

Simple Splitting Based on the Outcome

createDataPartition is used to create balanced splits of the data. If the y argument to this function is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data.

For example, to create a single 80/20% split of the iris data.

list = FALSE avoids returns the data as a list
times creates multiple splits at once
createResample is used to make simple bootstrap samples and createFolds to generate balanced cross–validation groupings from a set of data
createFolds splits the data into k groups
createTimeSlices creates cross-validation split for series data
groupKFold splits the data based on a grouping factor

trainIndex <- createDataPartition(iris$Species, p = .8, list = FALSE, times = 1)
trainIndex5 <- createDataPartition(iris$Species, p = .8, list = FALSE, times = 5)
head(trainIndex)

##      Resample1
## [1,]         2
## [2,]         3
## [3,]         4
## [4,]         6
## [5,]         8
## [6,]         9

head(trainIndex5)

##      Resample1 Resample2 Resample3 Resample4 Resample5
## [1,]         1         1         1         2         1
## [2,]         2         2         2         3         2
## [3,]         3         4         3         4         3
## [4,]         4         5         4         5         4
## [5,]         5         6         5         7         5
## [6,]         6         7         6         9         7

irisTrain <- iris[ trainIndex,]
irisTest  <- iris[-trainIndex,]

A 2nd example to drive the concept home.

TRAIN.PERCENT <- 0.75 
inTrainSetIndex <- createDataPartition(y = rawTrain$activity, p=TRAIN.PERCENT, list=FALSE)
training   <- rawTrain[ inTrainSetIndex, ]
validation <- rawTrain[-inTrainSetIndex, ]

Splitting Based on Predictors

maxDissim is used to create sub–samples using a maximum dissimilarity approach. This is particularly useful for unsupervised learning where there are no response variables.

Suppose data set A with m samples and a larger data set B with n samples. We may want to create a sub–sample from B that is diverse when compared to A. For each sample in B, the function calculates the m dissimilarities between each point in A. The most dissimilar point in B is added to A and the process continues.

There are many methods in R to calculate dissimilarity. caret uses proxy. caret includes two functions, minDiss and sumDiss that can be used to maximize the minimum and total dissimilarities.

As an example, the figure below shows a scatter plot of two chemical descriptors for the Cox2 data. Using an initial random sample of 5 compounds, we can select 20 more compounds from the data so that the new compounds are most dissimilar from the initial 5 that were specified. The panels in the figure show the results using several combinations of distance metrics and scoring functions. For these data, the distance measure has less of an impact than the scoring method for determining which compounds are most dissimilar.

data(BostonHousing)

#Function modified from carate help file
caretDiss <- function(pct = 1, obj = minDiss, ...){
################### Change ###################   
     tmp <- scale(BostonHousing[, c("age", "nox")])
     
     ## start with 10 data points
     start <- sample(1:dim(tmp)[1], 10)
     base <- tmp[start,]
     pool <- tmp[-start,]
     
     ## select 20 for addition
     newSamp <- maxDissim(base, pool, n = 20, randomFrac = pct, obj = obj, ...)
################### End Change ################### 
     
     allSamp <- c(start, newSamp)
     plot(
          tmp[-newSamp,], 
          xlim = extendrange(tmp[,1]), ylim = extendrange(tmp[,2]), 
          col = "darkgrey", 
          xlab = "variable 1", ylab = "variable 2")
     points(base, pch = 16, cex = .7)
     
     for(i in seq(along = newSamp))
          points(
               pool[newSamp[i],1], 
               pool[newSamp[i],2], 
               pch = paste(i), col = "darkred")}

par(mfrow=c(2,2))

caretDiss(1, minDiss)
title("No Random Sampling, Min Score")
caretDiss(.1, minDiss)
title("10 Pct Random Sampling, Min Score")
caretDiss(1, sumDiss)
title("No Random Sampling, Sum Score")
caretDiss(.1, sumDiss)
title("10 Pct Random Sampling, Sum Score")

Time Series Data Splitting

Simple random sampling of time series is a poor way to resample times series data. caret contains a function called createTimeSlices that creates the indices for this type of splitting. Typically used with trControl.

The three parameters for this type of splitting are:

initialWindow: the initial number of consecutive values in each training set sample
horizon: The number of consecutive values in test set sample
fixedWindow: A logical: if FALSE, the training set always start at the first sample and the training set size will vary over data splits.

data(economics)#ggplot2
head(economics)

## # A tibble: 6 x 6
##         date   pce    pop psavert uempmed unemploy
##       <date> <dbl>  <int>   <dbl>   <dbl>    <int>
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018

myTimeControl <- trainControl(method = "timeslice", initialWindow = 36, horizon = 12, fixedWindow = TRUE)

plsFitTime <- train(unemploy ~ pce + pop + psavert, data = economics, method = "pls",
                    preProc = c("center", "scale"), trControl = myTimeControl)

A better example can be found here: https://r-norberg.blogspot.com/2016/08/data-splitting-time-slices-with.html. It include a function for irregular timeslices!

Splitting with Important Groups

In some cases there is an important qualitative factor in the data that should be considered during (re)sampling. In clinical trials there may be hospital-to-hospital differences with repeated measures data or subjects may have multiple rows in the data set. You want to ensure these groups are not contained in the training and testing set since this may bias the test set performance to be more optimistic.

To split the data base by groups, groupKFold can be used. (This was added to caret in version 6.0-76.)

subjects <- sample(1:20, size = 80, replace = TRUE)
table(subjects)

## subjects
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  5  4  1  2  2  2  5  5  3  6  5  7  3  6  3  2  4  7  3  5

folds <- groupKFold(subjects, k = 15) 
folds[1:2]

## $Fold01
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24
## [24] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
## [47] 48 49 50 51 52 53 54 55 56 58 59 60 61 62 63 64 65 66 67 69 70 71 72
## [70] 73 74 75 76 77 78 79 80
## 
## $Fold02
##  [1]  1  2  3  4  5  6  8  9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25
## [24] 26 28 29 30 31 32 33 34 35 36 37 38 40 41 42 43 44 45 46 47 48 49 50
## [47] 51 52 53 54 55 56 57 58 59 60 61 62 64 65 66 67 68 69 70 71 72 73 74
## [70] 75 77 79 80

The results in folds can be used as inputs into the index argument of trainControl.

Dummy Variables

dummyVars can be used to generate a complete (less than full rank parameterized) set of dummy variables from one or more factors. The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method.

For example, the etitanic data set in earth includes two factors: pclass1 (with levels 1st, 2nd, 3rd) and sex (with levels female, male). model.matrix from stats generates the following variables:

data(etitanic)#earth
head(model.matrix(survived ~ ., data = etitanic))#stats

##   (Intercept) pclass2nd pclass3rd sexmale     age sibsp parch
## 1           1         0         0       0 29.0000     0     0
## 2           1         0         0       1  0.9167     1     2
## 3           1         0         0       0  2.0000     1     2
## 4           1         0         0       1 30.0000     1     2
## 5           1         0         0       0 25.0000     1     2
## 6           1         0         0       1 48.0000     0     0

Using dummyVars:

dummies <- dummyVars(survived ~ ., data = etitanic)
head(predict(dummies, newdata = etitanic))

##   pclass.1st pclass.2nd pclass.3rd sex.female sex.male     age sibsp parch
## 1          1          0          0          1        0 29.0000     0     0
## 2          1          0          0          0        1  0.9167     1     2
## 3          1          0          0          1        0  2.0000     1     2
## 4          1          0          0          0        1 30.0000     1     2
## 5          1          0          0          1        0 25.0000     1     2
## 6          1          0          0          0        1 48.0000     0     0

Note there is no intercept and each factor has a dummy variable for each level so this parameterization may not be useful for some model functions such as lm.

Zero and Near-Zero-Variance Predictors

In some situations, the data generating mechanism can create predictors that only have a single unique value (i.e. a “zero-variance predictor”). For many models (excluding tree-based models), this may cause the model to crash or the fit to be unstable.

Similarly, predictors might have only a handful of unique values that occur with very low frequencies. For example, in the drug resistance data, the nR11 descriptor (number of 11-membered rings) data have a few unique numeric values that are highly unbalanced:

data(mdrr)
data.frame(table(mdrrDescr$nR11))

##   Var1 Freq
## 1    0  501
## 2    1    4
## 3    2   23

The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. These “near-zero-variance” predictors may need to be identified and eliminated prior to modeling.

To identify these types of predictors, the following two metrics can be calculated: * the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio’’), which would be near one for well-behaved predictors and very large for highly-unbalanced data> * the”percent of unique values’’ is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases>

If the frequency ratio is less than a pre-specified threshold and the unique value percentage is less than a threshold, we might consider a predictor to be near zero-variance.

We would not want to falsely identify data that have low granularity but are evenly distributed, such as data from a discrete uniform distribution. Using both criteria should not falsely detect such predictors.

Looking at the MDRR data, the nearZeroVar function can be used to identify near zero-variance variables (the saveMetrics argument can be used to show the details and usually defaults to FALSE):

nzv <- nearZeroVar(mdrrDescr, saveMetrics= TRUE)
nzv[nzv$nzv,][1:10,]

##        freqRatio percentUnique zeroVar  nzv
## nTB     23.00000     0.3787879   FALSE TRUE
## nBR    131.00000     0.3787879   FALSE TRUE
## nI     527.00000     0.3787879   FALSE TRUE
## nR03   527.00000     0.3787879   FALSE TRUE
## nR08   527.00000     0.3787879   FALSE TRUE
## nR11    21.78261     0.5681818   FALSE TRUE
## nR12    57.66667     0.3787879   FALSE TRUE
## D.Dr03 527.00000     0.3787879   FALSE TRUE
## D.Dr07 123.50000     5.8712121   FALSE TRUE
## D.Dr08 527.00000     0.3787879   FALSE TRUE

dim(mdrrDescr)

## [1] 528 342

nzv <- nearZeroVar(mdrrDescr)
filteredDescr <- mdrrDescr[, -nzv]
dim(filteredDescr)

## [1] 528 297

By default, nearZeroVar will return the positions of the variables that are flagged to be problematic.

Here is another example to drive the concept home.

dim(rawTrain)

## [1] 7352  562

#Be patient - lots of data!
nzv <- nearZeroVar(rawTrain, saveMetrics=TRUE) 
count_nzv <- sum(nzv$nzv) 
count_nzv #since thae data is high precision numerical data, everything is unique!

## [1] 0

if (count_nzv > 0){
     rawTrain <- rawTrain[, !nzv$nzv] 
     finalTest <- finalTest[, !nzv$nzv]
     }

Correlated Predictors

While there are some models that thrive on correlated predictors (such as pls), other models benefit from reducing or removing the level of correlation between the predictors.

Given a correlation matrix, findCorrelation uses the following algorithm to flag predictors for removal:

descrCor <-  cor(filteredDescr)
highCorr <- sum(abs(descrCor[upper.tri(descrCor)]) > .999)

For the previous MDRR data, there are 65 descriptors that are almost perfectly correlated (|correlation| > 0.999) such as the total information index of atomic composition (IAC) and the total information content index (neighborhood symmetry of 0-order) (TIC0) (correlation = 1).

cor(filteredDescr$IAC, filteredDescr$TIC0)

## [1] 1

The code chunk below shows the effect of removing descriptors with absolute correlations above 0.75.

summary(descrCor[upper.tri(descrCor)])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.99607 -0.05373  0.25006  0.26078  0.65527  1.00000

highlyCorDescr <- findCorrelation(descrCor, cutoff = .75)
filteredDescr <- filteredDescr[,-highlyCorDescr]
descrCor2 <- cor(filteredDescr)
summary(descrCor2[upper.tri(descrCor2)])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.70728 -0.05378  0.04418  0.06692  0.18858  0.74458

Here is another example to help you remember this important function.

CUTOFF <- 0.90 
cor_matrix <- cor(rawTrain[,-1]) 
cor_high <- findCorrelation(cor_matrix, CUTOFF)
high_cor_remove <- row.names(cor_matrix)[cor_high] 
head(high_cor_remove, 8)

## [1] "v005.tBodyAcc.stdE.Y"    "v008.tBodyAcc.madE.Y"   
## [3] "v013.tBodyAcc.minE.X"    "v016.tBodyAcc.smaE"     
## [5] "v020.tBodyAcc.iqrE.X"    "v021.tBodyAcc.iqrE.Y"   
## [7] "v050.tGravityAcc.maxE.X" "v054.tGravityAcc.minE.Y"

length(high_cor_remove)

## [1] 347

rawTrain <- rawTrain[,   -cor_high] 
finalTest <- finalTest[, -cor_high]

Linear Dependencies

The function findLinearCombos uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist). For example, consider the following matrix that is could have been produced by a less-than-full-rank parameterizations of a two-way experimental layout:

ltfrDesign <- matrix(0, nrow=6, ncol=6)
ltfrDesign[,1] <- c(1, 1, 1, 1, 1, 1)
ltfrDesign[,2] <- c(1, 1, 1, 0, 0, 0)
ltfrDesign[,3] <- c(0, 0, 0, 1, 1, 1)
ltfrDesign[,4] <- c(1, 0, 0, 1, 0, 0)
ltfrDesign[,5] <- c(0, 1, 0, 0, 1, 0)
ltfrDesign[,6] <- c(0, 0, 1, 0, 0, 1)
ltfrDesign

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    0    1    0    0
## [2,]    1    1    0    0    1    0
## [3,]    1    1    0    0    0    1
## [4,]    1    0    1    1    0    0
## [5,]    1    0    1    0    1    0
## [6,]    1    0    1    0    0    1

Note that columns two and three add up to the first column. Similarly, columns four, five and six add up the first column. findLinearCombos will return a list that enumerates these dependencies. This is often not easy to find in larger data sets! For each linear combination, it will incrementally remove columns from the matrix and test to see if the dependencies have been resolved. findLinearCombos will also return a vector of column positions can be removed to eliminate the linear dependencies:

comboInfo <- findLinearCombos(ltfrDesign)
comboInfo

## $linearCombos
## $linearCombos[[1]]
## [1] 3 1 2
## 
## $linearCombos[[2]]
## [1] 6 1 4 5
## 
## 
## $remove
## [1] 3 6

ltfrDesign[, -comboInfo$remove]

##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    0
## [2,]    1    1    0    1
## [3,]    1    1    0    0
## [4,]    1    0    1    0
## [5,]    1    0    0    1
## [6,]    1    0    0    0

Centering and Scaling

The preProcess class can be used for many operations on predictors, including centering and scaling. preProcess estimates the required parameters for each operation and predict.preProcess is used to apply them to specific data sets.

In the example below, half the MDRR data are used to estimate the location and scale of the predictors. preProcess doesn’t actually pre-process the data. predict.preProcess is used to pre-process this and other data sets.

inTrain <- sample(seq(along = mdrrClass), length(mdrrClass)/2)

training <- filteredDescr[inTrain,]
glimpse(training)

## Observations: 264
## Variables: 50
## $ AMW      <dbl> 6.38, 7.11, 7.42, 6.64, 7.07, 6.80, 10.76, 8.14, 5.99...
## $ Mp       <dbl> 0.65, 0.62, 0.67, 0.61, 0.66, 0.63, 0.73, 0.70, 0.64,...
## $ Ms       <dbl> 1.96, 2.55, 2.10, 2.25, 2.10, 2.17, 2.06, 2.47, 1.84,...
## $ nDB      <dbl> 0, 2, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, 2, 0, 1, 1, 0, 0,...
## $ nAB      <dbl> 12, 12, 20, 12, 13, 13, 16, 12, 6, 12, 6, 21, 6, 12, ...
## $ nS       <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nF       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nCL      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ nR05     <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ nR06     <dbl> 2, 2, 4, 4, 4, 2, 2, 2, 3, 3, 2, 3, 3, 2, 3, 2, 4, 4,...
## $ nR07     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nR09     <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ nR10     <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 1,...
## $ nBnz     <dbl> 2, 1, 2, 0, 2, 2, 2, 2, 1, 1, 1, 3, 0, 2, 2, 2, 2, 0,...
## $ HNar     <dbl> 1.932, 1.765, 2.090, 1.964, 2.000, 1.842, 1.842, 1.98...
## $ Xt       <dbl> 0.001, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.00...
## $ SPI      <dbl> 27.295, 15715.120, 38.197, 335.058, 113.600, 1670.555...
## $ Jhetm    <dbl> 2.313, 2.728, 1.925, 2.220, 1.363, 1.609, 2.192, 2.30...
## $ MAXDN    <dbl> 1.317, 2.524, 2.074, 1.577, 1.984, 1.607, 1.651, 2.38...
## $ MAXDP    <dbl> 2.583, 6.147, 4.879, 3.644, 5.635, 5.757, 6.588, 5.31...
## $ TIE      <dbl> 182.262, 617.140, 139.590, 226.878, 159.073, 270.378,...
## $ X5v      <dbl> 1.343, 2.451, 2.905, 3.552, 3.852, 2.637, 3.083, 1.56...
## $ BLI      <dbl> 0.995, 0.978, 0.909, 1.018, 1.077, 0.949, 1.186, 0.85...
## $ PW2      <dbl> 0.550, 0.556, 0.579, 0.548, 0.565, 0.569, 0.566, 0.57...
## $ PW3      <dbl> 0.297, 0.319, 0.345, 0.330, 0.339, 0.353, 0.346, 0.35...
## $ PW4      <dbl> 0.164, 0.174, 0.192, 0.202, 0.190, 0.188, 0.188, 0.22...
## $ PJI2     <dbl> 0.800, 1.000, 1.000, 0.857, 0.900, 1.000, 1.000, 1.00...
## $ BAC      <dbl> 9, 98, 5, 53, 13, 53, 56, 5, 2, 2, 7, 0, 17, 13, 15, ...
## $ Lop      <dbl> 1.460, 2.353, 0.415, 1.432, 0.945, 1.000, 1.774, 0.56...
## $ IVDE     <dbl> 1.523, 1.769, 1.681, 1.771, 1.660, 1.944, 2.100, 1.46...
## $ BIC2     <dbl> 0.528, 0.656, 0.637, 0.599, 0.683, 0.595, 0.719, 0.55...
## $ BIC5     <dbl> 0.690, 0.790, 0.838, 0.737, 0.870, 0.801, 0.824, 0.69...
## $ VEA1     <dbl> 3.748, 4.321, 4.520, 4.718, 3.763, 3.710, 4.265, 3.74...
## $ VRA1     <dbl> 145.523, 1951.851, 369.085, 503.008, 2363.843, 4235.2...
## $ piPC10   <dbl> 378.422, 885.188, 6437.338, 4063.289, 535.516, 386.62...
## $ PCR      <dbl> 6.642, 6.734, 47.580, 12.859, 13.570, 10.984, 29.614,...
## $ T.O..O.  <dbl> 0, 196, 0, 64, 4, 104, 18, 4, 0, 0, 0, 0, 4, 0, 6, 42...
## $ H3D      <dbl> 286.370, 1756.189, 1302.913, 296.152, 214.613, 202.00...
## $ G1       <dbl> 48.691, 282.760, 101.155, 117.085, 96.397, 84.028, 16...
## $ SPAM     <dbl> 0.347, 0.343, 0.314, 0.284, 0.356, 0.360, 0.346, 0.34...
## $ SPH      <dbl> 0.932, 0.955, 0.951, 0.921, 0.960, 0.961, 0.962, 1.00...
## $ FDI      <dbl> 0.700, 0.691, 0.707, 0.681, 0.712, 0.716, 0.724, 0.70...
## $ PJI3     <dbl> 0.893, 0.979, 0.912, 0.949, 0.962, 0.890, 0.861, 0.73...
## $ L.Bw     <dbl> 2.3, 3.6, 1.8, 1.5, 9.7, 12.8, 2.9, 1.1, 3.0, 18.5, 1...
## $ DISPm    <dbl> 4.464, 1.166, 7.507, 0.571, 6.409, 4.142, 10.510, 5.0...
## $ QXXm     <dbl> 42.068, 126.214, 137.359, 192.254, 69.517, 56.063, 17...
## $ DISPe    <dbl> 0.029, 0.048, 0.262, 0.006, 0.271, 0.136, 0.087, 0.29...
## $ G.N..N.  <dbl> 0.00, 5.32, 31.28, 70.34, 26.94, 3.53, 0.00, 0.00, 0....
## $ G.N..O.  <dbl> 2.69, 61.44, 43.11, 164.47, 25.52, 55.64, 17.57, 2.47...
## $ G.O..Cl. <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,...

test <- filteredDescr[-inTrain,]
trainMDRR <- mdrrClass[inTrain]
glimpse(trainMDRR)

##  Factor w/ 2 levels "Active","Inactive": 2 1 1 1 1 2 1 2 2 1 ...

testMDRR <- mdrrClass[-inTrain]

preProcValues <- preProcess(training, method = c("center", "scale"))

trainTransformed <- predict(preProcValues, training)
glimpse(trainTransformed)

## Observations: 264
## Variables: 50
## $ AMW      <dbl> -0.81772225, 0.12340305, 0.52305900, -0.48252694, 0.0...
## $ Mp       <dbl> -0.06451537, -1.08643882, 0.61676693, -1.42707997, 0....
## $ Ms       <dbl> -0.79232630, 1.83403558, -0.16912179, 0.49859733, -0....
## $ nDB      <dbl> -0.8987343, 1.3826682, 0.2419669, -0.8987343, 1.38266...
## $ nAB      <dbl> -0.18439442, -0.18439442, 1.77259559, -0.18439442, 0....
## $ nS       <dbl> -0.395661, -0.395661, -0.395661, -0.395661, 1.978305,...
## $ nF       <dbl> -0.3015384, -0.3015384, -0.3015384, -0.3015384, -0.30...
## $ nCL      <dbl> -0.394284, -0.394284, -0.394284, -0.394284, -0.394284...
## $ nR05     <dbl> -0.5087485, -0.5087485, 1.4099600, -0.5087485, -0.508...
## $ nR06     <dbl> -0.8437012, -0.8437012, 1.0847586, 1.0847586, 1.08475...
## $ nR07     <dbl> -0.2253051, -0.2253051, -0.2253051, -0.2253051, -0.22...
## $ nR09     <dbl> -0.2795251, -0.2795251, -0.2795251, -0.2795251, -0.27...
## $ nR10     <dbl> -0.5887437, -0.5887437, 0.6068589, 0.6068589, 0.60685...
## $ nBnz     <dbl> 0.3190301, -0.9570902, 0.3190301, -2.2332105, 0.31903...
## $ HNar     <dbl> -0.08914668, -1.44022920, 1.18912300, 0.16974338, 0.4...
## $ Xt       <dbl> 0.08108121, -0.21214400, -0.21214400, -0.21214400, -0...
## $ SPI      <dbl> -0.24160233, 1.07344563, -0.24068846, -0.21580379, -0...
## $ Jhetm    <dbl> 0.8400048, 1.7521595, -0.0128048, 0.6355943, -1.24806...
## $ MAXDN    <dbl> -0.46738960, 0.78369413, 0.31725861, -0.19789352, 0.2...
## $ MAXDP    <dbl> -0.82524552, 1.18206590, 0.46790459, -0.22767049, 0.8...
## $ TIE      <dbl> -0.03294621, 3.35416218, -0.36530308, 0.31455177, -0....
## $ X5v      <dbl> -1.31049076, -0.22798030, 0.21557543, 0.84769119, 1.1...
## $ BLI      <dbl> -0.12426595, -0.35134476, -1.27301759, 0.18295833, 0....
## $ PW2      <dbl> -1.32290876, -0.88779924, 0.78012062, -1.46794527, -0...
## $ PW3      <dbl> -1.92727570, -0.80330570, 0.52502249, -0.24132069, 0....
## $ PW4      <dbl> -1.0617372, -0.4788746, 0.5702780, 1.1531406, 0.45370...
## $ PJI2     <dbl> -1.4938314, 1.0020328, 1.0020328, -0.7825102, -0.2458...
## $ BAC      <dbl> -0.5567852, 2.3465048, -0.6872701, 0.8785492, -0.4263...
## $ Lop      <dbl> 0.86665751, 2.42228597, -0.95375877, 0.81788081, -0.0...
## $ IVDE     <dbl> -1.07667588, 0.13723403, -0.29701017, 0.14710321, -0....
## $ BIC2     <dbl> -1.80525566, 0.15878810, -0.13274964, -0.71582514, 0....
## $ BIC5     <dbl> -1.89222922, -0.20559098, 0.60399538, -1.09950925, 1....
## $ VEA1     <dbl> -0.90958859, 0.28739176, 0.70309698, 1.11671323, -0.8...
## $ VRA1     <dbl> -0.06176004, -0.06125755, -0.06169785, -0.06166059, -...
## $ piPC10   <dbl> -0.5558009, -0.1876559, 3.8457562, 2.1211057, -0.4416...
## $ PCR      <dbl> -0.43405581, -0.42683501, 2.77904137, 0.05389732, 0.1...
## $ T.O..O.  <dbl> -0.4523017, 2.6010446, -0.4523017, 0.5447093, -0.3899...
## $ H3D      <dbl> -0.1808622, 0.9267856, 0.5851992, -0.1734905, -0.2349...
## $ G1       <dbl> -0.542390413, 2.109348708, 0.051967837, 0.232436862, ...
## $ SPAM     <dbl> 0.13266001, -0.02230567, -1.14580681, -2.30804937, 0....
## $ SPH      <dbl> -0.20571947, 0.49133892, 0.37011137, -0.53909522, 0.6...
## $ FDI      <dbl> -0.23914940, -0.61973813, 0.05686406, -1.04261450, 0....
## $ PJI3     <dbl> 0.09429028, 1.25711842, 0.35119418, 0.85148070, 1.027...
## $ L.Bw     <dbl> -0.7279560, -0.5245905, -0.8061735, -0.8531040, 0.429...
## $ DISPm    <dbl> -0.30910678, -1.14831993, 0.46521875, -1.29972436, 0....
## $ QXXm     <dbl> -0.65576720, 1.31960197, 1.58123638, 2.86992379, -0.0...
## $ DISPe    <dbl> -0.90404034, -0.72357048, 1.30909014, -1.12250387, 1....
## $ G.N..N.  <dbl> -0.5654677, -0.2481808, 1.3000838, 3.6296376, 1.04124...
## $ G.N..O.  <dbl> -0.653570610, 0.842680991, 0.375850491, 3.466660606, ...
## $ G.O..Cl. <dbl> -0.3001080, -0.3001080, -0.3001080, -0.3001080, -0.30...

testTransformed <- predict(preProcValues, test)

The preProcess option “ranges” scales the data to the interval [0, 1].

Imputation

preProcess can be used to impute data sets based only on information in the training set. One method of doing this is with K-nearest neighbors. For an arbitrary sample, the K closest neighbors are found in the training set and the value for the predictor is imputed using these values (e.g. using the mean). Using this approach will automatically trigger preProcess to center and scale the data, regardless of what is in the method argument.

By default, 5 nearest neighbors are being evaluated. You can change it by specifying the parameter k in preProcess

Evaluate knn imputation in several different ways:

Single Missing Numeric Value

model <- "knnImpute"
iris_miss_1 <- iris[, -5]
iris_miss_1[1,1] <- NA
head(iris_miss_1, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           NA         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

test1 = preProcess(iris_miss_1, method = "knnImpute")
test1Result <- predict(test1, iris_miss_1)
head(test1Result, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   -0.7824364   1.0156020    -1.335752   -1.311052
## 2   -1.1444955  -0.1315388    -1.335752   -1.311052
## 3   -1.3858682   0.3273175    -1.392399   -1.311052

Single Missing Values in All Numeric Predictors

iris_miss_2 <- iris[, -5]
iris_miss_2[1,1] <- NA
iris_miss_2[2,2] <- NA
iris_miss_2[3,3] <- NA
iris_miss_2[4,4] <- NA
head(iris_miss_2,5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           NA         3.5          1.4         0.2
## 2          4.9          NA          1.4         0.2
## 3          4.7         3.2           NA         0.2
## 4          4.6         3.1          1.5          NA
## 5          5.0         3.6          1.4         0.2

test2 = preProcess(iris_miss_2, method = "knnImpute")
test2Result <- predict(test2, iris_miss_2)
head(test2Result, 5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   -0.7824364  1.01136747    -1.349443    -1.32312
## 2   -1.1444955  0.78269713    -1.349443    -1.32312
## 3   -1.3858682  0.32535645    -1.281246    -1.32312
## 4   -1.5065546  0.09668612    -1.292612    -1.32312
## 5   -1.0238091  1.24003781    -1.349443    -1.32312

Single Missing Values in Multiple Numeric Predictors

iris_miss_3 <- iris[, -5]
iris_miss_3[1, 1:2] <- NA
head(iris_miss_3, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           NA          NA          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

test3 = preProcess(iris_miss_3, method = "knnImpute")
test3Result <- predict(test3, iris_miss_3)
head(test3Result, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1    -1.072084   0.7930928    -1.335752   -1.311052
## 2    -1.144495  -0.1247389    -1.335752   -1.311052
## 3    -1.385868   0.3341770    -1.392399   -1.311052

Single Missing Values in All Numeric Predictors

iris_miss_4 <- iris[, -5]
iris_miss_4[1,] <- NA
test4 = preProcess(iris_miss_4, method = "knnImpute")
test4Result <- try(predict(test4, iris_miss_4), silent = TRUE)
head(test4Result)#error

## [1] "Error in FUN(newX[, i], ...) : \n  cannot impute when all predictors are missing in the new data point\n"

Single Missing Value in Categorical Predictor

iris_miss_5 <- iris
iris_miss_5[1,5] <- NA
set.seed(1)
test5Result <- train(Sepal.Length ~ . , data = iris_miss_5, method = "lm", 
           preProc = "knnImpute", na.action = na.pass)
head(test5Result)#ends in error

Each Row with at Least 1 Missing Value

iris_miss_6 <- iris[, -5]
index <- 0
for(i in 1:nrow(iris_miss_6)) {
     index <- if(index < ncol(iris_miss_6)) index + 1 else 1
     iris_miss_6[i, index] <- NA}
test6 = preProcess(iris_miss_6, method = "knnImpute")
test6Result <- try(predict(test6, iris_miss_6), silent = TRUE)
head(test6Result)

## [1] "Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols,  : \n  Cannot find more nearest neighbours than there are points\n"

Alternatively, bagged trees can also be used to impute. For each predictor in the data, a bagged tree is created using all of the other predictors in the training set. When a new sample has a missing predictor value, the bagged model is used to predict the value. While, in theory, this is a more powerful method of imputing, the computational costs are much higher than the nearest neighbor technique.

Run the same test above replacing knnImpute with bagImpute.

Single Missing Numeric Value

iris_miss_1 <- iris[, -5]
iris_miss_1[1,1] <- NA
head(iris_miss_1, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           NA         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

test1 = preProcess(iris_miss_1, method = "bagImpute")
test1Result <- predict(test1, iris_miss_1)
head(test1Result, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     4.959451         3.5          1.4         0.2
## 2     4.900000         3.0          1.4         0.2
## 3     4.700000         3.2          1.3         0.2

Single Missing Values in All Numeric Predictors

iris_miss_2 <- iris[, -5]
iris_miss_2[1,1] <- NA
iris_miss_2[2,2] <- NA
iris_miss_2[3,3] <- NA
iris_miss_2[4,4] <- NA
head(iris_miss_2,5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           NA         3.5          1.4         0.2
## 2          4.9          NA          1.4         0.2
## 3          4.7         3.2           NA         0.2
## 4          4.6         3.1          1.5          NA
## 5          5.0         3.6          1.4         0.2

test2 = preProcess(iris_miss_2, method = "bagImpute")
test2Result <- predict(test2, iris_miss_2)
head(test2Result, 5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     4.928995    3.500000     1.400000   0.2000000
## 2     4.900000    3.292596     1.400000   0.2000000
## 3     4.700000    3.200000     1.459964   0.2000000
## 4     4.600000    3.100000     1.500000   0.2573922
## 5     5.000000    3.600000     1.400000   0.2000000

Single Missing Values in Multiple Numeric Predictors

iris_miss_3 <- iris[, -5]
iris_miss_3[1, 1:2] <- NA
head(iris_miss_3, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           NA          NA          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

test3 = preProcess(iris_miss_3, method = "bagImpute")
test3Result <- predict(test3, iris_miss_3)
head(test3Result, 3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.034189    3.297311          1.4         0.2
## 2     4.900000    3.000000          1.4         0.2
## 3     4.700000    3.200000          1.3         0.2

Single Missing Values in All Numeric Predictors

iris_miss_4 <- iris[, -5]
iris_miss_4[1,] <- NA
set.seed(1)
test4 = preProcess(iris_miss_4, method = "bagImpute")
test4Result <- try(predict(test4, iris_miss_4), silent = TRUE)
head(test4Result)#error

## [1] "Error in cprob[tindx] <- cprob[tindx] + pred : \n  replacement has length zero\n"

Single Missing Value in Categorical Predictor

iris_miss_5 <- iris
iris_miss_5[1,5] <- NA
set.seed(1)
test5Result <- train(Sepal.Length ~ . , data = iris_miss_5, method = "lm", 
           preProc = "bagImpute", na.action = na.pass)
head(test5Result)#ends in error

Each Row with at Least 1 Missing Value

iris_miss_6 <- iris[, -5]
index <- 0
for(i in 1:nrow(iris_miss_6)) {
     index <- if(index < ncol(iris_miss_6)) index + 1 else 1
     iris_miss_6[i, index] <- NA}

test6 = preProcess(iris_miss_6, method = "bagImpute")
test6Result <- try(predict(test6, iris_miss_6), silent = TRUE)
head(test6Result)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          NaN         3.5          1.4         0.2
## 2          4.9         NaN          1.4         0.2
## 3          4.7         3.2          NaN         0.2
## 4          4.6         3.1          1.5         NaN
## 5          NaN         3.6          1.4         0.2
## 6          5.4         NaN          1.7         0.4

Transforming Predictors

In some cases there is a need to use principal component analysis (PCA) to transform the data to a smaller sub–space where the new variables are uncorrelated with one another. The preProcess class can apply this transformation by including pca in the method argument. Doing this will also force scaling of the predictors. Note that when PCA is requested, predict.preProcess changes the column names to PC1, PC2 and so on.

In preprocess thresh is a cutoff for the cumulative percent of variance to be retained by PCA

trans = preProcess(iris[,1:4], method=c("BoxCox", "center", "scale", "pca"))
PC = predict(trans, iris[,1:4])
head(PC, 5)

##         PC1        PC2
## 1 -2.303540 -0.4748260
## 2 -2.151310  0.6482903
## 3 -2.461341  0.3463921
## 4 -2.413968  0.6066839
## 5 -2.432777 -0.6110057

A second more thorough example on how to select a specific number of PCA components. Set pcaComp = 7 inside preProcess or use thresh = 0.8 and then apply your processing to the training and testing data. Remember, if you have categorical variables you must convert them first to dummy variables before you can apply your processing (center, scale, pca, etc.).

data(Boston)#MASS
train_idx <- createDataPartition(Boston$medv, p = 0.75, list = FALSE)

train <- Boston[train_idx,]
test <- Boston[-train_idx,]

#Now using preProcess, you need to set the pcaComp = 7, or thresh = 0.8
#you may need to center and scale first and then apply PCA or just use method = c("pca")
#create the preProc object, remember to exclude the response (medv)
preProc  <- preProcess(train[,-14], method = c("center", "scale", "pca"), pcaComp = 7) # or thresh = 0.8
#Apply the processing to the train and test data, and add the response to the dataframes
train_pca <- predict(preProc, train[,-14])
train_pca$medv <- train$medv
test_pca <- predict(preProc, test[,-14])
test_pca$medv <- test$medv

#you can verify the 7 comp
head(train_pca)

##          PC1         PC2         PC3        PC4        PC5        PC6
## 1 -2.0960761  0.74230541  0.33322081  0.1966248 0.96824476  0.3187772
## 2 -1.4678653  0.58636579 -0.69037414 -0.2639753 0.43774470 -0.2901575
## 3 -2.0915688  0.54600322  0.06372609 -1.0341427 0.61986491 -0.5193874
## 4 -2.6230664 -0.04424800 -0.22192804 -1.0670302 0.17916682 -0.7606018
## 6 -2.2210568 -0.01425377 -0.72160739 -0.5631681 0.05588535 -0.6396086
## 8 -0.8560468  0.59910176 -0.38783493  1.1453370 0.74546594  0.7006312
##          PC7 medv
## 1 -0.3931796 24.0
## 2 -0.5611659 21.6
## 3 -0.4445695 34.7
## 4 -0.5267303 33.4
## 6 -0.5893609 28.7
## 8 -0.3187991 27.1

#Now fit your lm model
fit <- lm(medv~., data = train_pca)
fit$coefficients

## (Intercept)         PC1         PC2         PC3         PC4         PC5 
##  22.6081365  -2.3186566   2.0043909   3.5224221  -2.7141782   0.6091648 
##         PC6         PC7 
##   0.3469728   0.4627322

Similarly, independent component analysis (ICA) can also be used to find new variables that are linear combinations of the original set such that the components are independent (as opposed to uncorrelated in PCA). The new variables will be labeled as IC1, IC2 and so on. (I do not have an example of this - I have never used this feature.)

The “spatial sign” transformation projects the data for a predictor to the unit circle in p dimensions, where p is the number of predictors. Essentially, a vector of data is divided by its norm. The two figures below show two centered and scaled descriptors from the MDRR data before and after the spatial sign transformation. The predictors should be centered and scaled before applying this transformation.

data(mdrr)
transparentTheme(trans = .4)
plotSubset <- data.frame(scale(mdrrDescr[, c("nC", "X4v")]))
xyplot(nC ~ X4v, data = plotSubset, groups = mdrrClass, auto.key = list(columns = 2))

After the spatial sign:

transformed <- spatialSign(plotSubset)
transformed <- as.data.frame(transformed)
xyplot(nC ~ X4v, data = transformed, groups = mdrrClass, auto.key = list(columns = 2))

Another option, “BoxCox” will estimate a Box–Cox transformation on the predictors if the data are greater than zero. (See the Appendix for a more thorough example.)

data(mdrr)
nzv <- nearZeroVar(mdrrDescr)
filteredDescr <- mdrrDescr[, -nzv]
inTrain <- sample(seq(along = mdrrClass), length(mdrrClass)/2)
training <- filteredDescr[inTrain,]
test <- filteredDescr[-inTrain,]
preProcValues2 <- preProcess(training, method = "BoxCox")
trainBC <- predict(preProcValues2, training)
testBC <- predict(preProcValues2, test)
preProcValues2

## Created from 264 samples and 258 variables
## 
## Pre-processing:
##   - Box-Cox transformation (258)
##   - ignored (0)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000  0.1000  0.5000  0.4926  0.9000  2.0000

The NA values correspond to the predictors that could not be transformed. This transformation requires the data to be greater than zero.

Class Distance Calculations

caret contains functions to generate new predictors variables based on distances to class centroids (similar to how linear discriminant analysis works). For each level of a factor variable, the class centroid and covariance matrix is calculated. For new samples, the Mahalanobis distance to each of the class centroids is computed and can be used as an additional predictor. This can be helpful for non-linear models when the true decision boundary is actually linear.

In cases where there are more predictors within a class than samples, the classDist function has arguments called pca and keep arguments that allow for principal components analysis within each class to be used to avoid issues with singular covariance matrices.

predict.classDist is then used to generate the class distances. By default, the distances are logged but this can be changed via the trans argument to predict.classDist.

trainSet <- sample(1:150, 100)
distData <- classDist(iris[trainSet, 1:4], iris$Species[trainSet])
newDist <- predict(distData, iris[-trainSet, 1:4])
splom(newDist, groups = iris$Species[-trainSet])

Appendix

Inadequate data pre-processing is one of the common reasons on why some predictive models fail.

How your predictors are encoded can have a significant impact on model performance. For example, the ratio of two predictors may be more effective than using two independent predictors. This will depend on the model used as well as on the particularities of the phenomenon you want to predict. The manufacturing of predictors to improve prediction performance is called feature engineering. To succeed at this stage you should have a deep understanding of the problem you are trying to model.

A good practice is to center, scale and apply skewness transformations for each of the individual predictors. This practice gives more stability for numerical algorithms used later in the fitting of different models as well as improve the predictive ability of some models. Box and Cox transformation, centering and scaling can be applied using preProcess from caret.

require(gridExtra)
require(reshape2)

predictors = data.frame(x1 = rnorm(1000, mean = 5, sd = 2), x2 = rexp(1000, rate=10))
p1 = ggplot(predictors) + geom_point(aes(x = x1, y = x2))

temp <- melt(predictors, measured = c("x1", "x2"))
p2 = ggplot(temp) + geom_histogram(aes(x=value)) + facet_grid(. ~ variable, scales = "free_x")

grid.arrange(p1, p2)

trans <- preProcess(predictors, c("BoxCox", "center", "scale"))
predictors_trans <- data.frame(trans = predict(trans, predictors))

p1 = ggplot(predictors_trans) + geom_point(aes(x = trans.x1, y = trans.x2))
temp <- melt(predictors_trans, measured = c("trans.x1", "trans.x2"))

p2 = ggplot(temp) + geom_histogram(aes(x=value), data = temp) + 
  facet_grid(. ~ variable, scales = "free_x")

grid.arrange(p1, p2)

References

This document started as a near-copy of this well-known and often-imitated site: http://topepo.github.io/caret/pre-processing.html. I have added and changed the content over time as my expertise in preprocessing matures. (It is a never-ending process!)