Reputation: 461
I was reading through below website for decision tree classification part. http://spark.apache.org/docs/latest/mllib-decision-tree.html
I built provided example code into my laptop and tried to understand it's output. but I couldn't understand a bit. The below is the code and sample_libsvm_data.txt can be found below https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt
Please refer the output, and let me know whether my opinion is correct. Here is my opinions.
(most curious one)if feature 434 is greater than 0.0 then, it would be 1 based on gini impurity? for example, the value is given as 434:178 then it would be 1.
from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
if __name__ == "__main__":
sc = SparkContext(appName="PythonDecisionTreeClassificationExample")
data = MLUtils.loadLibSVMFile(sc,'/home/spark/bin/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', maxDepth=5, maxBins=32)
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
// =====Below is my output=====
Test Error = 0.0454545454545
Learned classification tree model:
DecisionTreeModel classifier of depth 1 with 3 nodes
If (feature 434 <= 0.0)
Predict: 0.0
Else (feature 434 > 0.0)
Predict: 1.0
Upvotes: 0
Views: 1253
Reputation: 1
Why in Spark ML, when training a decision tree model, the minInfoGain or minimum number of instances per node are not used to control the growth of the tree? It is very easy to over grow the tree.
Upvotes: 0
Reputation: 2182
I believe you are correct. Yes, your error rate is about 5%, so your algorithm is correct about 95% of the time for that 30% of the data you withheld as testing. According to your output (which I will assume is correct, I did not test the code myself), yes, the only feature that determines the class of the observation is feature 434, and if it is less than 0 it is 0, else 1.
Upvotes: 2