Jin Park
Jin Park

Reputation: 461

Spark Decision Tree with Spark

I was reading through below website for decision tree classification part. http://spark.apache.org/docs/latest/mllib-decision-tree.html

I built provided example code into my laptop and tried to understand it's output. but I couldn't understand a bit. The below is the code and sample_libsvm_data.txt can be found below https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt

Please refer the output, and let me know whether my opinion is correct. Here is my opinions.

  1. Test Error mean it has approximately 95% correction based on training Data?
  2. (most curious one)if feature 434 is greater than 0.0 then, it would be 1 based on gini impurity? for example, the value is given as 434:178 then it would be 1.

    from __future__ import print_function
    from pyspark import SparkContext
    from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
    from pyspark.mllib.util import MLUtils
    
    if __name__ == "__main__":
      sc = SparkContext(appName="PythonDecisionTreeClassificationExample")
      data = MLUtils.loadLibSVMFile(sc,'/home/spark/bin/sample_libsvm_data.txt')
      (trainingData, testData) = data.randomSplit([0.7, 0.3])
    
      model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', maxDepth=5, maxBins=32)
    
      predictions = model.predict(testData.map(lambda x: x.features))
      labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
      testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
    
    print('Test Error = ' + str(testErr))
    print('Learned classification tree model:')
    print(model.toDebugString())
    
    // =====Below is my output=====
    Test Error = 0.0454545454545
    Learned classification tree model:
    DecisionTreeModel classifier of depth 1 with 3 nodes
    If (feature 434 <= 0.0)
      Predict: 0.0
    Else (feature 434 > 0.0)
      Predict: 1.0
    

Upvotes: 0

Views: 1253

Answers (2)

Jimmy
Jimmy

Reputation: 1

Why in Spark ML, when training a decision tree model, the minInfoGain or minimum number of instances per node are not used to control the growth of the tree? It is very easy to over grow the tree.

Upvotes: 0

Katya Willard
Katya Willard

Reputation: 2182

I believe you are correct. Yes, your error rate is about 5%, so your algorithm is correct about 95% of the time for that 30% of the data you withheld as testing. According to your output (which I will assume is correct, I did not test the code myself), yes, the only feature that determines the class of the observation is feature 434, and if it is less than 0 it is 0, else 1.

Upvotes: 2

Related Questions