Feature Selection with caret

Reminder: Feature extraction: Transforming the existing features into a lower dimensional space. Feature selection: Selecting a subset of the existing features without a transformation.

Introduction

Selecting the right features in your data can mean the difference between mediocre performance with long training times and great performance with short training times.

The caret R package provides tools automatically report on the relevance and importance of attributes in your data and even select the most important features for you.

Lets discover the feature selection tools in the Caret R package

How to remove redundant features from your dataset.
How to rank features in your dataset by their importance.
How to select features from your dataset using the Recursive Feature Elimination method.

RemoveRedundant Features

Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed.

The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed.

The following example loads the Pima Indians Diabetes dataset that contains a number of biological attributes from medical reports. A correlation matrix is created from these attributes and highly correlated attributes are identified, in this case the age attribute is remove as it correlates highly with the pregnant attribute.

Generally, you want to remove attributes with an absolute correlation of 0.75 or higher.

# ensure the results are repeatable
set.seed(7)
# load the data
data(PimaIndiansDiabetes)
# calculate correlation matrix
correlationMatrix <- cor(PimaIndiansDiabetes[,1:8])
# summarize the correlation matrix
print(correlationMatrix)

##             pregnant    glucose   pressure     triceps     insulin
## pregnant  1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461
## glucose   0.12945867 1.00000000 0.15258959  0.05732789  0.33135711
## pressure  0.14128198 0.15258959 1.00000000  0.20737054  0.08893338
## triceps  -0.08167177 0.05732789 0.20737054  1.00000000  0.43678257
## insulin  -0.07353461 0.33135711 0.08893338  0.43678257  1.00000000
## mass      0.01768309 0.22107107 0.28180529  0.39257320  0.19785906
## pedigree -0.03352267 0.13733730 0.04126495  0.18392757  0.18507093
## age       0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295
##                mass    pedigree         age
## pregnant 0.01768309 -0.03352267  0.54434123
## glucose  0.22107107  0.13733730  0.26351432
## pressure 0.28180529  0.04126495  0.23952795
## triceps  0.39257320  0.18392757 -0.11397026
## insulin  0.19785906  0.18507093 -0.04216295
## mass     1.00000000  0.14064695  0.03624187
## pedigree 0.14064695  1.00000000  0.03356131
## age      0.03624187  0.03356131  1.00000000

# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
# print indexes of highly correlated attributes
print(highlyCorrelated)

## [1] 8

#use corrplot because it is so useful!
corrplot(correlationMatrix, method="circle")

corrplot.mixed(correlationMatrix)

Rank Features By Importance

The importance of features can be estimated from data by building a model. Some methods like decision trees have a built in mechanism to report on variable importance. For other algorithms, the importance can be estimated using a ROC curve analysis conducted for each attribute.

The example below loads the Pima Indians Diabetes dataset and constructs an Learning Vector Quantization (LVQ) model. The varImp is then used to estimate the variable importance, which is printed and plotted. It shows that the glucose, mass and age attributes are the top 3 most important attributes in the dataset and the insulin attribute is the least important.

# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)

## ROC curve variable importance
## 
##          Importance
## glucose      0.7881
## mass         0.6876
## age          0.6869
## pregnant     0.6195
## pedigree     0.6062
## pressure     0.5865
## triceps      0.5536
## insulin      0.5379

# plot importance
plot(importance)

Feature Selection

Automatic feature selection methods can be used to build many models with different subsets of a dataset and identify those attributes that are and are not required to build an accurate model.

A popular automatic method for feature selection provided by the caret R package is called Recursive Feature Elimination or RFE.

The example below provides an example of the RFE method on the Pima Indians Diabetes dataset. A Random Forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes. All 8 attributes are selected in this example, although in the plot showing the accuracy of the different attribute subset sizes, we can see that just 4 attributes gives almost comparable results.

# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          1   0.7096 0.3006    0.04998 0.11943         
##          2   0.7305 0.3828    0.06076 0.14849         
##          3   0.7475 0.4295    0.04662 0.10923         
##          4   0.7645 0.4688    0.05772 0.12635         
##          5   0.7578 0.4548    0.04052 0.08816         
##          6   0.7553 0.4446    0.05478 0.12738         
##          7   0.7514 0.4385    0.04465 0.10356         
##          8   0.7708 0.4808    0.04824 0.10821        *
## 
## The top 5 variables (out of 8):
##    glucose, mass, age, pregnant, pedigree

# list the chosen features
predictors(results)

## [1] "glucose"  "mass"     "age"      "pregnant" "pedigree" "insulin" 
## [7] "triceps"  "pressure"

# plot the results
plot(results, type=c("g", "o"))

You discovered 3 feature selection methods provided by the caret R package. Specifically, searching for and removing redundant features, ranking features by importance and automatically selecting a subset of the most predictive features.