blue-sky
blue-sky

Reputation: 53826

How to classify new training example after model training in apache spark?

Reading the src of https://spark.apache.org/docs/1.5.2/ml-ann.html :

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row

// Load training data
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt").toDF()
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network: 
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)
// train the model
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator()
  .setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))

Once the model has been trained how can a new training example be classified ?

Can a new training example be added to the model where the label is not set and the classifier will try to classify this training example based on the training data ?

Why is it required that the dataframe labels are of type Double ?

Upvotes: 2

Views: 1140

Answers (1)

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18022

Firstly, the only way to add another observation to the model is by incorporating that data point into the training data, in this case to your train variable. In order to achieve this, you can convert that point into a DataFrame (obviously of only one record) and then use the unionAll method. Nevertheless, you will have to retrain the model using this new dataset.

However, to classify observations using your model you will have to convert your unclassified data into a DataFrame with the same structure that had your training data. And then use the method transform of your model. By the way, notice that models have that method, because they are subclasses of Transformer.

Finally, you have to use Double because that is the way how the LabeledPoint class was defined. It receives a Double as label and a SparseVector or DenseVector as features. I don't know the exact motivation but in my own experience, which isn't wide, all classification and regression algorithms work with float point numbers.Furthermore, gradient descent algorithm, which is widely used to fit most models, uses numbers not letters nor classes to compute the error in each iteration.

Upvotes: 3

Related Questions