Reputation: 93
I am trying to understand how to properly use the R package FSelector, and in particular, its information.gain function. According to the documentation:
information gain = H(class) + H(attribute) - H(class,attribute)
What do these quantities mean? And how do they relate to the standard definition of Information Gain. As far as I know, Information Gain due to an attribute = H(S) - sum p(S_i)H(S_i)
where H(.)
is entropy; S
is the unpartitioned set; S_i
are the subsets of S
induced by the attribute; and p(S_i) = |S_i|/|S|
.
I would also like to know if there are any other packages that use the concept of Information Gain.
Thank you for your help.
Upvotes: 3
Views: 14376
Reputation: 131
The idea behind FSelector and its functions is to choose the best combination of attributes found in a data set. Maybe, some attributes are unnecesary (maybe), that depends on the dataset you are dealing with.
information.gain is a function that select the best combination of attributes according to its "Information Gain". This functions is based on entropy (You can read a lot of docs about that).
Here is an example using the famous IRIS dataset (See the full example at http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=FSelector/man/information.gain.Rd&d=R_CC):
library(FSelector)
data(iris)
weights <- information.gain(Species~., iris)
print(weights)
subset <- cutoff.k(weights, 2)
f <- as.simple.formula(subset, "Species")
print(f)
That mean that the most important attributes are Petal.Width and Petal.Length
There are a lot of libraries using similar functions! (RWeka, CORElearn, FSelector...)
Upvotes: 9