Feature Subset Selection in r using Wrappers

Wrapper methods use a search algorithm to locate possible subsets of features and measure the accuracy of each subset selection against a chosen learning algorithm.

Feature selection techniques using a ranking method measures statistical dependence between each individual feature and the class. Filter methods measure statistical dependence between subsets and the class as well as correlation within subsets.

Filters vs Wrappers

The wrapper approach is done in two parts. It wraps an evaluator containing a learning algorithm while it builds subsets and feeds them into the evaluator. The evaluator scores each subset using the enclosed learning algorithm. When the wrapper has iterated over the complete list of subset possibilities, it returns the highest scoring subset.

Recreated from: Comparison between a filter and a wrapper approach to variable subset selection in regression problems. (see resources)

Wrappers in Action

Using the feature subset selection wrapper inside the r Fselector library requires us to build an evaluator function to wrap the learning algorithm. Here we use the rpart learning algorithm with four different subset feature selection searches. There is an example function in the FSelector vignette we can modify

The Evaluator

This user defined r evaluator function uses the rpart learning algorithm. Rpart implements regression and survival trees as described in the vignette. Since we are using regression we decide to transform the character feature in our data set to numeric. In the following files product type is now numeric and can be downloaded here as: existing-product-wrappers.csv, and new-products-wrappers.csv.

Before the feature selection search algorithm can use this evaluator the function must be run inside r to be stored in current workspace.

# Evaluator function for FSelector wrappers 
evaluator <- function(subset) {       # scorer function
k <- 5 # k-fold cross validation
splits <- runif(nrow(oldProducts))
results = sapply(1:k, function(i) {   # iterate cross folds
  test.idx <- (splits >= (i - 1)      # assign subset to test
                    / k) &
                    (splits < i / k)
  train.idx <- !test.idx              # remainder to train
  test <- oldProducts[test.idx,
                       , drop=FALSE]
  train <- oldProducts[train.idx,
                       , drop=FALSE]  
  tree <- rpart(as.simple.formula(subset
                       , "Volume")    # learning algorithm
                       , train)
  error.rate = sum(test$Species !=    # score
                    predict(tree
                       , test
                       , type="vector")) / nrow(test)
  return(1 - error.rate)
  })                                  # return score to
  return(mean(results))               # feature search algorithm
}

Running a function like this stores it in the working session and makes it available to be called. You can also save this function to a file and load it by using:

# load your function from a file 
source("YourNamedFunctionFile.R")

Hill Climbing Search for Feature Subset Selection

This starts by choosing a random feature set. It then adds neighbors and measures the result of the new set. If the addition of the new neighbor decreases the error or increases accuracy it returns that set. The problem with this approach is that it can find a foothill, a local maximum, but not the global maximum.

# Hill climbing search wraps the rpart learning algorithm  
library(FSelector)                  # Provides search algorithm
library(rpart)                      # Provides learning algorithm
setwd("/Users/xyz/rData") 
oldProducts <- read.csv('existing_products_wrappers.csv' 
                      , sep="," 
                      , header = T 
                      , as.is=T 
                      , stringsAsFactors=F 
                      , check.names = FALSE) 
set.seed(444)                       # Allows replication 
subset  <-  hill.Climbing.search(names(oldProducts)[-16]
               ,  evaluator))
f <-  as.simple.formula(subset, "Volume")
print(f)
# Feature subset selection
# ProductType + StarReviews5 + StarReviews4 + StarReviews2 + 
#    StarReviews1 + PosServiceReview + WouldRecommend + ShippingWeight + 
#      ProductDepth + ProductHeight

We see the results which include such features as the width of the product and the height. What domain knowledge do we have. Do we ourselves make decisions about a new product purchase based on its shipping weight. Although it is not mentioned in these results, ProductID may also be similarly unnecessary in predictions. We decide to remove those from our data set and try again.

# Hill climbing search wraps the rpart learning algorithm  
library(FSelector)                  # Provides search algorithm
library(rpart)                      # Provides learning algorithm
setwd("/Users/xyz/rData") 
oldProducts <- read.csv('existing_products_wrappers.csv' 
                      , sep="," 
                      , header = T 
                      , as.is=T 
                      , stringsAsFactors=F 
                      , check.names = FALSE) 
# remove columns from data
oldProducts <- subset(oldProducts, select = -c(ProductID
                                             , ShippingWeight
                                             , ProductDepth
                                             , ProductWidth
                                             , ProductHeight
                                             , Profitmargin))
set.seed(444)                       # Allows replication 
subset  <-   hill.Climbing.search(names(oldProducts)[-16]
               ,  evaluator))
f <-  as.simple.formula(subset, "Volume")
print(f)
# Feature subset selection
#    StarReviews5 + StarReviews2 + StarReviews1 + PosServiceReview + 
#      NegServiceReview + WouldRecommend + BestSellersRank

Backward Search Selection

Starts with all features and removes them one by one. At each step it removes the one that decreases the error and increases accuracy the most until further removal. It stops when a removal increases the error or accuracy decreases significantly. We will use the same data as above with the features describing physical dimensions such as shipping weight and length and width removed

# Backward search wraps the rpart learning algorithm  
library(FSelector)                  # Provides search algorithm
library(rpart)                      # Provides learning algorithm
setwd("/Users/xyz/rData") 
oldProducts <- read.csv('existing_products_wrappers.csv' 
                      , sep="," 
                      , header = T 
                      , as.is=T 
                      , stringsAsFactors=F 
                      , check.names = FALSE) 

# remove columns from data
oldProducts <- subset(oldProducts, select = -c(ProductID
                                             , ShippingWeight
                                             , ProductDepth
                                             , ProductWidth
                                             , ProductHeight
                                             , Profitmargin))
set.seed(123); system.time(subset  <-  
                             backward.search(names(oldProducts)[-12]
                          ,  evaluator))
f <-  as.simple.formula(subset, "Volume")
print(f)
#  Feature subset selection
#  ProductType + Price + StarReviews5 + StarReviews4 + 
#    StarReviews3 + StarReviews2 + StarReviews1 + PosServiceReview + 
#    NegServiceReview + WouldRecommend + BestSellersRank

All Relevant Features

In some cases redundant features carry powerful information. Bioinformatics information containing gene expression can be more important than classification accuracy. The significance of every gene acting in a biological process, however small can be critical.
Boruta finds all relevant variables and is an algorithm that wraps a random forest classification learning algorithm no matter how small the contribution.

# Backward search wraps the rpart learning algorithm  
library(FSelector)                  # Provides search algorithm
library(rpart)                      # Provides learning algorithm
setwd("/Users/xyz/rData") 
oldProducts <- read.csv('existing_products_wrappers.csv' 
                      , sep="," 
                      , header = T 
                      , as.is=T 
                      , stringsAsFactors=F 
                      , check.names = FALSE) 
set.seed(123); system.time(
  boruta_output <- Boruta(Volume~.
                          , oldProducts
                          , doTrace = 1))
plot(boruta_output, las=2, cex.axis=.7
                  , las=2, xlab=""
                  , sort=TRUE
                  , main="Variable Importance") 
legend(x=1, y=12, legend = c("Confirmed"
                             , "Tentative"
                             , "Rejected")
                             , fill=c(3,7,2)) 
borutaConfirmed <- names(boruta_output$finalDecision[
                          boruta_output$finalDecision %in%
                            c("Confirmed")])
print(borutaConfirmed)
#  Feature subset selection
#  StarReviews5 +  StarReviews4 + StarReviews3 + StarReviews2 + 
#    StarReviews1 + PosServiceReview + NegServiceReview

Boruta Box Plot Showing Features Evaluated

Resources

Kojadinovic and Wattka. Comparison between a filter and a wrapper approach to variable subset selection in regression problems. Research Gate. 1999.
Jantawan and Tsai. Approaches with Data Mining Techniques for Categorical Variables Selection. International Journal of Innovative Research in Computer and Communication Engineering. V2, I6, 2014.
Kursa and Rudniki. Feature Selection with the Boruta Pacakge. Journal of Statistical Software. 2010.