Reputation: 75
I am building a decision tree classifier and I found this method for calculating information gain. This may be a silly question, but I am wondering if the split in this method is for numeral or categorical attributes? I'm confused because I thought a threshold (median) was used for numeral splits, but this method uses String values.
Any help is appreciated.
Here is the code:
public static double getInfoGain(int f, ArrayList<String[]> dataSubset) {
double entropyBefore = getEntropy(dataSubset); //Entropy before split
if(entropyBefore != 0){ // Calculate information gain if entropy is not 0
String threshold = thresholdMap.get(f); // Get threshold value of the feature
ArrayList<String[]> leftData = new ArrayList<String[]>();
ArrayList<String[]> rightData = new ArrayList<String[]>();
for(String[] d : dataSubset) {
if(d[f].equals(threshold)) {
leftData.add(d); // If feature value of data == threshold, add it to leftData
} else {
rightData.add(d); // If feature value of data != threshold, add it to leftData
}
}
if(leftData.size() > 0 && rightData.size() > 0) {
double leftProb = (double)leftData.size()/dataSubset.size();
double rightProb = (double)rightData.size()/dataSubset.size();
double entropyLeft = getEntropy(leftData); //Entropy after split - left
double entropyRight = getEntropy(rightData); //Entropy after split - right
double gain = entropyBefore - (leftProb * entropyLeft) - (rightProb * entropyRight);
return gain;
} else { // If entropy = 0 on either subsets of data, return 0
return 0;
}
} else { // If entropy = 0 before split, return 1
return -1;
}
}
Upvotes: 0
Views: 57
Reputation: 987
Although the code you pointed at uses the terminology of thresholds, if you look at the comments, it is using them in a categorical or binary way.
if(d[f].equals(threshold)) {
leftData.add(d); // If feature value of data == threshold, add it to leftData
} else {
rightData.add(d); // If feature value of data != threshold, add it to leftData
}
I would strongly recommend looking at the algorithms from a textbook or Wikipedia as a reference instead of going directly to code. Or, if you find yourself needing code examples, I would look for repositories on Github that are of higher quality (three dimensions).
Ideally, you want many indications of trust. If I go to github, search for decision tree, check Java, sort by most stars, I'd look at one of sanity/quickml or saebyn/java-decision-tree myself.
Upvotes: 0