Reputation: 55
I am a new user of spark on Scala, here is my code, but I can not figure out how I can calculate prediction and accuracy. Do I have to transform the CSV file into Libsvm format, or can I just load the CSV file?
object Test2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("WineQualityDecisionTreeRegressorPMML")
.master("local")
.getOrCreate()
// Load and parse the data file.
val df = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("file:///c:/tmp/spark-warehouse/winequality_red_names.csv")
val inputFields = List("fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides",
"free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol")
val toDouble = udf[Double, String]( _.toDouble)
val dff = df.
withColumn("fixed acidity", toDouble(df("fixed acidity"))). // 0 +
withColumn("volatile acidity", toDouble(df("volatile acidity"))). // 1 +
withColumn("citric acid", toDouble(df("citric acid"))). // 2 -
withColumn("residual sugar", toDouble(df("residual sugar"))). // 3 +
withColumn("chlorides", toDouble(df("chlorides"))). // 4 -
withColumn("free sulfur dioxide", toDouble(df("free sulfur dioxide"))). // 5 +
withColumn("total sulfur dioxide", toDouble(df("total sulfur dioxide"))). // 6 +
withColumn("density", toDouble(df("density"))). // 7 -
withColumn("pH", toDouble(df("pH"))). // 8 +
withColumn("sulphates", toDouble(df("sulphates"))). // 9 +
withColumn("alcohol", toDouble(df("alcohol"))) // 10 +
val assembler = new VectorAssembler().
setInputCols(inputFields.toArray).
setOutputCol("features")
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("quality")
.setOutputCol("indexedLabel")
.fit(dff)
// specify layers for the neural network:
// input layer of size 11 (features), two intermediate of size 10 and 20
// and output of size 6 (classes)
val layers = Array[Int](11, 10, 20, 6)
// Train a DecisionTree model.
val dt = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
.setLabelCol("indexedLabel")
.setFeaturesCol("features")
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// create pileline
val pipeline = new Pipeline()
.setStages(Array(assembler, labelIndexer, dt, labelConverter))
// Train model
val model = pipeline.fit(dff)
}
}
Any idea please? I can't find any example for neural networking with a CSV file using pipline.
Upvotes: 0
Views: 232
Reputation: 26046
When you have your model trained (val model = pipeline.fit(dff)
), you need to predict for every test sample the label using model.transform
method. For each prediction you have to check, if it matches label. Then accuracy would be the ratio of properly classified to size of training set.
If you want to use the same DataFrame
, that was used for training, then simply val predictions = model.transform(dff)
. Then iterate over predictions
and check, if they match with corresponding labels. However I do not recommend reusing DataFrame
- it's better to split it for training and testing subsets.
Upvotes: 1