yPennylane
yPennylane

Reputation: 772

MLR package: generateFilterValuesData chi.squared and information.gain

I am experimenting with the mlr package and would like to get chi-squared and information-gain values.

library(mlr)
library(FSelector)

data(PimaIndiansDiabetes)
indi <- sample(1:nrow(PimaIndiansDiabetes), 0.6 * nrow(PimaIndiansDiabetes))
train <- PimaIndiansDiabetes[indi,]

trainTask <- makeClassifTask(data = train, target = "diabetes", positive = "pos")

#Feature importance
im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared"))
plotFilterValues(im_feat)
im_feat

I am not sure about the consequences that there are two zeros in information.gain and chi.squared for the variables triceps and pressure. Does that indicate I should not use them for setting up a model (e.g. random forest)?

When I use

tbl <- table(train$triceps, train$diabetes)
chisq.test(tbl)

it gives me 60.473 for chi-squared. Why is it not 0? What's the difference between chisq and the chi-squared-method from mlr?

Upvotes: 0

Views: 299

Answers (1)

Lars Kotthoff
Lars Kotthoff

Reputation: 109242

Regarding your first question, values of 0 generally indicate that the feature is not predictive wrt the variable that you're interested, based on the particular evaluation method that you applied. This does not necessarily mean that the same is true for a particular type of model, and hence it usually doesn't make sense to remove it. Apart from that, many models perform feature selection internally (one of these being random forests), so this kind of preprocessing doesn't make sense in general, unless you have so many features that a random forest takes too long to build a model, for example.

The chi.squared test in mlr and chi.sq are based on different implementations; not sure why they're not returning the same result.

Upvotes: 0

Related Questions