Reputation: 65
I am using Random Forest algorithm for classification in Spark MLlib using PySpark. My codes are as follows:\
model = RandomForest.trainClassifier(trnData, numClasses=3, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity='gini', maxDepth=4, maxBins=32)
predictions = model.predict(tst_dataRDD.map(lambda x: x.features))
labelsAndPredictions = tst_dataRDD.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda x: x[0] != x[1]).count() / float(tst_dataRDD.count())
I got IllegalArgumentException: GiniAggregator given label -0.0625but requires label to be non-negative.
How can I solve this problem? Thanks
Upvotes: 0
Views: 360
Reputation: 6323
It seems for Gini
impurity during multiclass classification, the labels must be positive (>=0). Please check if there are any negative labels present.
ref - spark repo
Also, on side note, please use algorithm from ml
package and not from legacy mllib
Upvotes: 1
Reputation: 733
Please use RandomForestClassifier
instead and see the docs:
https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
Upvotes: 0