Numeral or Categorical split?

Question

I am building a decision tree classifier and I found this method for calculating information gain. This may be a silly question, but I am wondering if the split in this method is for numeral or categorical attributes? I'm confused because I thought a threshold (median) was used for numeral splits, but this method uses String values.

Any help is appreciated.

Here is the code:

    public static double getInfoGain(int f, ArrayList dataSubset) {
            double entropyBefore = getEntropy(dataSubset); //Entropy before split
            if(entropyBefore != 0){ // Calculate information gain if entropy is not 0
                String threshold = thresholdMap.get(f); // Get threshold value of the feature
                ArrayList leftData = new ArrayList();
                ArrayList rightData = new ArrayList();
                for(String[] d : dataSubset) {
                    if(d[f].equals(threshold)) {
                        leftData.add(d); // If feature value of data == threshold, add it to leftData
                    } else {
                        rightData.add(d); // If feature value of data != threshold, add it to leftData
                    }
                }
                if(leftData.size() > 0 && rightData.size() > 0) {
                    double leftProb = (double)leftData.size()/dataSubset.size(); 
                    double rightProb = (double)rightData.size()/dataSubset.size();
                    double entropyLeft = getEntropy(leftData); //Entropy after split - left
                    double entropyRight = getEntropy(rightData); //Entropy after split - right
                    double gain = entropyBefore - (leftProb * entropyLeft) - (rightProb * entropyRight);
                    return gain;
                } else { // If entropy = 0 on either subsets of data, return 0
                    return 0;
                }
            } else { // If entropy = 0 before split, return 1
                return -1;
            }
        }

John Foley · Accepted Answer

Although the code you pointed at uses the terminology of thresholds, if you look at the comments, it is using them in a categorical or binary way.

if(d[f].equals(threshold)) {
   leftData.add(d); // If feature value of data == threshold, add it to leftData
} else {
   rightData.add(d); // If feature value of data != threshold, add it to leftData
}

I would strongly recommend looking at the algorithms from a textbook or Wikipedia as a reference instead of going directly to code. Or, if you find yourself needing code examples, I would look for repositories on Github that are of higher quality (three dimensions).

You want to study code with a clear license. In many places, not having a license is equivalent to being proprietary, despite Github's implied Open-Source nature, that's just not legally accurate.
You want to study code that people use. There are many more decision tree algorithm implementations on github that have more than zero stars and issues.
Failing that, you want to study code that has tests (an indication and a chance to test if it actually works for yourself).

Ideally, you want many indications of trust. If I go to github, search for decision tree, check Java, sort by most stars, I'd look at one of sanity/quickml or saebyn/java-decision-tree myself.

Numeral or Categorical split?

Answers (1)

Related Questions