Reputation: 772
I am experimenting with the mlr package and would like to get chi-squared and information-gain values.
library(mlr)
library(FSelector)
data(PimaIndiansDiabetes)
indi <- sample(1:nrow(PimaIndiansDiabetes), 0.6 * nrow(PimaIndiansDiabetes))
train <- PimaIndiansDiabetes[indi,]
trainTask <- makeClassifTask(data = train, target = "diabetes", positive = "pos")
#Feature importance
im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared"))
plotFilterValues(im_feat)
im_feat
I am not sure about the consequences that there are two zeros in information.gain
and chi.squared
for the variables triceps
and pressure
. Does that indicate I should not use them for setting up a model (e.g. random forest)?
When I use
tbl <- table(train$triceps, train$diabetes)
chisq.test(tbl)
it gives me 60.473
for chi-squared. Why is it not 0? What's the difference between chisq
and the chi-squared-method from mlr
?
Upvotes: 0
Views: 299
Reputation: 109242
Regarding your first question, values of 0 generally indicate that the feature is not predictive wrt the variable that you're interested, based on the particular evaluation method that you applied. This does not necessarily mean that the same is true for a particular type of model, and hence it usually doesn't make sense to remove it. Apart from that, many models perform feature selection internally (one of these being random forests), so this kind of preprocessing doesn't make sense in general, unless you have so many features that a random forest takes too long to build a model, for example.
The chi.squared test in mlr and chi.sq
are based on different implementations; not sure why they're not returning the same result.
Upvotes: 0