Reputation: 9
ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 20)
scala.MatchError: [0.0,(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
I am seeing this error with my scala program where I am trying to classify movie reviews using a NaiveBayes Classifier. I am seeing this error on the line at which I attempt to train the NaiveBayes Classifer. I am unable to correct this error as I am unaware of the datatype that the classifier is expecting. The documentation for NaiveBayes says it expects a RDD entry which is what I have. Any help would be greatly appreciated. Please find my full SCALA code for this movie review classification program.
PS: Please ignore the possible mistakes in indenting in the code. Its right in my program file. Thanks in advance.
import org.apache.spark.sql.{Dataset, DataFrame, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer, PCA}
import org.apache.spark.mllib.classification.{NaiveBayes,NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg._
//Reading the file from csv into dataframe object
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.option("header", "true").option("delimiter",",").option("inferSchema", "true").csv("movie-pang02.csv")
//Tokenizing the data by splitting the text into words
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(df)
//Hashing the data by converting the words into rawFeatures
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200)
val featurizedData = hashingTF.transform(wordsData)
//Applying Estimator on the data which converts the raw features into features by scaling each column
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
val coder: (String => Int) = (arg: String) => {if (arg == "Pos") 1 else 0}
val sqlfunc = udf(coder)
val new_set = rescaledData.withColumn("label", sqlfunc(col("class")))
val EntireDataRdd = new_set.select("label","features").map{case Row(label: Int, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray))}
//Converted the data into RDD<LabeledPoint> format so as to input it into the inbuilt Naive Bayes classifier
val labeled = EntireDataRdd.rdd
val Array(trainingData, testData) = labeled.randomSplit(Array(0.7, 0.3), seed = 1234L)
//Error in the following statement
val model = NaiveBayes.train(trainingData, lambda = 1.0, modelType = "multinomial")
val predictionAndLabel = testData.map(p => (model.predict(p.features), p.label))
val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / testData.count()
val testErr = predictionAndLabel.filter(r => r._1 != r._2).count.toDouble / testData.count()
Upvotes: 1
Views: 1142
Reputation: 37822
This is a painful (and not uncommon) pitfall - you're matching the contents of your Row
to the wrong Vector
class - it should be org.apache.spark.ml.linalg.Vector
and not org.apache.spark.mllib.linalg.Vector
... (yes - frustrating!)
Adding the right import before the mapping solves this issue:
import org.apache.spark.ml.linalg.Vector // and not org.apache.spark.mllib.linalg.Vector!
import org.apache.spark.mllib.linalg.Vectors // and not org.apache.spark.ml.linalg.Vectors!
val EntireDataRdd = new_set.select("label","features").map {
case Row(label: Int, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
Upvotes: 1