user3434580
user3434580

Reputation: 93

What does FSelector information.gain measure?

I am trying to understand how to properly use the R package FSelector, and in particular, its information.gain function. According to the documentation:

information gain = H(class) + H(attribute) - H(class,attribute)

What do these quantities mean? And how do they relate to the standard definition of Information Gain. As far as I know, Information Gain due to an attribute = H(S) - sum p(S_i)H(S_i) where H(.) is entropy; S is the unpartitioned set; S_i are the subsets of S induced by the attribute; and p(S_i) = |S_i|/|S|.

I would also like to know if there are any other packages that use the concept of Information Gain.

Thank you for your help.

Upvotes: 3

Views: 14376

Answers (1)

andresram1
andresram1

Reputation: 131

The idea behind FSelector and its functions is to choose the best combination of attributes found in a data set. Maybe, some attributes are unnecesary (maybe), that depends on the dataset you are dealing with.

information.gain is a function that select the best combination of attributes according to its "Information Gain". This functions is based on entropy (You can read a lot of docs about that).

Here is an example using the famous IRIS dataset (See the full example at http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=FSelector/man/information.gain.Rd&d=R_CC):

library(FSelector)
data(iris)

weights <- information.gain(Species~., iris)
print(weights)

subset <- cutoff.k(weights, 2)

f <- as.simple.formula(subset, "Species")
print(f)

That mean that the most important attributes are Petal.Width and Petal.Length

There are a lot of libraries using similar functions! (RWeka, CORElearn, FSelector...)

Upvotes: 9

Related Questions