Feature Subset Selection in r using Filters

Feature subset selection (FSS) using a filter is the process of finding the best combination of attributes in the available data to produce the highest prediction accuracy.

Feature subset selection (FSS) is the process of finding the best set of attributes in the available data to produce the highest prediction accuracy.

This post extends the previous post Feature Selection in r using Ranking. and precedes Feature Selection in r using Wrappers. Ranking as we saw is a univariate method. It considers each feature individually. Filtering is a multivariate method. It considers subsets of features. Ranking and filtering are the simplest and fastest forms of feature selection and the least complex.

Here, as with ranking, we assume a business has asked us to predict sales volume for a list of new products. They provided a list of existing products. Each existing product has a sales volume and 17 features. The new products for predictions also have the same 17 features. If you don’t have the data from the last post you can download it here.

The first question is: do we need all the features?

Seventeen features is a small number when compared to gene expression data, or MRI brain scans where there can be thousands of features or more. However, our simple example helps us to get started. We we also see that even with very small sets of data, feature selection can produce significant gains in prediction accuracy.

Feature Subset Selection Filtering

Filter algorithms expose relationships between features as well as correlation to the class (in this case sales volume).

Wrapper methods are a third feature subset selection method. As shown in the next post they are more costly than either ranking or filters. A learning algorithm is wrapped inside a feature subset selection algorithm. As each potential subset is found it is tested against the learning algorithm and scored.

Since filter methods don’t use a prediction model they produce more general results and there is no guarantee the optimal set of features found will be independent of the representational bias of any learning models chosen at at later date.

Both ranking and filters are considered preprocessing steps that are stepping stones to understanding the underlying data. Both can perform feature selection quickly with very low overhead and their results compared.

Recreated from: Comparison between a filter and a wrapper approach to variable subset selection in regression problems. (see resources)

There are a number of common filter approaches. Information gain, gain ratio, symmetric uncertainty, linear correlation, Chi-square, Fisher Score, mRmR and CFS. ^[1] Here we will look at the first three which depend on information gain: which collectively are referred to as entropy based filter methods. We finish by looking at a fourth algorithm linear correlation.

What is Entropy

Entropy is “… relative to what you know and not a property of the system itself.” ^[2]
Or in this case what we know about the data. In the context of feature selection entropy is a measurement based on how often each value in a feature occurs. For example how many products had 10 for the feature 5 Star Ratings.

Entropy for all the data is calculated first. Features are then grouped together (partitioned) and their entropy is calculated and subtracted from the original entropy. This subtraction gives information gain. Whichever subset results in the biggest information gain is the one chosen.

>There is an excellent video ^[3] which uses a simple example to illustrate how to calculate entropy manually using Shannon’s famous entropy definition H.

Entropy Based Feature Selection in r

Here we use the FSelector package in r to do the math for us. The lower a subset’s entropy (H value), the higher the information gain and higher the accurate predictions.

Three FSelector entropy based algorithms considered here are: Information Gain, Gain Ratio, and Symmetric Uncertainty.

Information Gain

Information gain is the reduction in entropy H. It is calculated in two steps. Fist calculate the entropy for the entire set of features. Subtract from that each feature, and then subsets of features. It is also known as expected mutual information. The larger the information gain the better the prediction.$$Information\ Gain = Total\ starting\ entropy\ -\ Entropy\ of\ this\ feature$$$$Information\ Gain = H(class)+H(features)\ -\ H(class, feature)$$

The calculation uses frequency of values in each feature so it is biased towards features which have a large number of different values. This may lead to overfitting.

For example, each row in our products list has a different ProductID. This does give us a high information gain but does not generalize well meaning: a new product will never have the same ProductId as an existing one. So in the code below we remove the ProductID before we begin.

# entropy-based: information gain 
library(FSelector) # provides info gain, others 
setwd("/Users/xyz/rData") 
oldProducts <- read.csv('existing-product-attributes.csv' 
                      , sep="," 
                      , header = T 
                      , as.is=T 
                      , stringsAsFactors=F 
                      , check.names = FALSE) 
oldProducts <- oldProducts[-1] # remove ProductID 
weights.ig <- information.gain(Volume~., oldProducts
                              , unit ="log2") 
subset.ig <- cutoff.k(weights.ig, 5) 
results.ig <- as.simple.formula(subset.ig, "Volume") 
print(results.ig) 
# Best feature subset ... 
# StarReviews5 + PosServiceReview + StarReviews4 + 
#                StarReviews3 + ProductType

Gain Ratio

Gain Ratio reduces the bias of a particular feature’s information gain by dividing the results by the entropy of the feature. Here this is done for each feature then subsets of features. The result of gain ratio is a set of features with the highest gain ratio (lowest bias).$$Gain\text{ }Ratio = \frac{H(class) + H(features)\ -\ H(class, feature)} { H(feature)}$$

# entropy-based: gain ratio
library(FSelector) # provides info gain, others
setwd("/Users/xyz/rData")
oldProducts <- read.csv('existing-product-attributes.csv'
                      , sep=","
                      , header = T
                      , as.is=T
                      , stringsAsFactors=F
                      , check.names = FALSE)
oldProducts <- oldProducts[-1]           # remove ProductID 
weights.gr <- gain.ratio (Volume~.,oldProducts
                         , unit ="log2") # default log e
results.gr <- as.simple.formula(subset.gr, "Volume")
print(results.gr)
# Best feature subset ...
# StarReviews5 + PosServiceReview + StarReviews4 +
#                StarReviews2 + StarReviews1

Symmetric Uncertainty

Symmetric uncertainty is another algorithm. It finds features that correlate well with the class but not each other. Correlation between two features is measured using symmetric uncertainty with values between 1 and zero. As an example, lets start with the idea of a predominant feature. Now test the second feature against it for symmetric uncertainty. If the second feature is just as correlated to the predominant feature as it is to the class you get a value of 1. This means the second feature is redundant with the predominant feature.
$$Symmetrical\text{ }Uncertainty = \frac{H(class) + H(attribute) – H(class, attribute)}{H(attribute) + H(class)}$$

# entropy-based: symmetrical uncertainty
library(FSelector) # provides info gain, others 
setwd("/Users/xyz/rData") 
oldProducts <- read.csv('existing-product-attributes.csv' 
                      , sep="," 
                      , header = T 
                      , as.is=T 
                      , stringsAsFactors=F 
                      , check.names = FALSE)
oldProducts <- oldProducts[-1]           # remove ProductID 
weights.su <- symmetrical.uncertainty(Volume~.,oldProducts
                                      , unit ="log2") 
results.su <- as.simple.formula(subset.su, "Volume")
print(results.su) 
# Best feature subset ... 
# StarReviews5 + PosServiceReview + StarReviews4 + 
#                StarReviews3 + StarReviews2

Linear Correlation

This measures a linear relationship between two features. If the correlation is low there is no tendency for the features to increase or decrease together. If two features are uncorrelated; however, they may still have nonlinear relationship and are therefore not necessarily independent.

Since this algorithm uses linear regression it will not work on non-continuous features (i.e. ProductType in our example). So we first need to remove it from out data

# Linear correlation
setwd("/Users/xyz/rData")
oldProducts <- read.csv('existing-product-attributes.csv'
                      , sep=","
                      , header = T
                      , as.is=T
                      , stringsAsFactors=F
                      , check.names = FALSE)
weightsLinearCorr <- linear.correlation(Volume~.
                                         , oldProducts)
# Error in FUN(X[[i]], ...) : All data must be continous.
oldProducts <- oldProducts[-2]  # remove ProductType
weightsLinearCorr <- linear.correlation(Volume~.
                                         , oldProducts)
subsetLinearCorr <- cutoff.k(weightsLinearCorr, 5)
resultsLinearCorr <- as.simple.formula(subsetLinearCorr, "Volume")
print(resultsLinearCorr)
# Best feature subset ... 
# StarReviews5 + StarReviews4 + StarReviews3 + PosServiceReview +
#                StarReviews2

References

[1] Jovic, Brkic and Bogunovic. A review of feature selection methods with applications. MIPRO. 2015.

[2] Eichenlaub. What Is An Intuitive Way To Understand Entropy?. Forbes. 2016.

[3] mfschulte222. Introduction to Entropy for Data Science. MS in Data Analytics. CUNY. 2014.

Feature Subset Selection in r Using Filters