Spark Decision Tree with Spark

Question

I was reading through below website for decision tree classification part. http://spark.apache.org/docs/latest/mllib-decision-tree.html

I built provided example code into my laptop and tried to understand it's output. but I couldn't understand a bit. The below is the code and sample_libsvm_data.txt can be found below https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt

Please refer the output, and let me know whether my opinion is correct. Here is my opinions.

Test Error mean it has approximately 95% correction based on training Data?

(most curious one)if feature 434 is greater than 0.0 then, it would be 1 based on gini impurity? for example, the value is given as 434:178 then it would be 1.

from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

if __name__ == "__main__":
  sc = SparkContext(appName="PythonDecisionTreeClassificationExample")
  data = MLUtils.loadLibSVMFile(sc,'/home/spark/bin/sample_libsvm_data.txt')
  (trainingData, testData) = data.randomSplit([0.7, 0.3])

  model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', maxDepth=5, maxBins=32)

  predictions = model.predict(testData.map(lambda x: x.features))
  labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
  testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

// =====Below is my output=====
Test Error = 0.0454545454545
Learned classification tree model:
DecisionTreeModel classifier of depth 1 with 3 nodes
If (feature 434 <= 0.0)
  Predict: 0.0
Else (feature 434 > 0.0)
  Predict: 1.0

Katya Willard · Accepted Answer

I believe you are correct. Yes, your error rate is about 5%, so your algorithm is correct about 95% of the time for that 30% of the data you withheld as testing. According to your output (which I will assume is correct, I did not test the code myself), yes, the only feature that determines the class of the observation is feature 434, and if it is less than 0 it is 0, else 1.

Spark Decision Tree with Spark

Answers (2)

Related Questions