Ioannis Apomachos
Ioannis Apomachos

Reputation: 11

Mean Squared Error (MSE) returns a huge number

I'm new in Scala and Spark in general. I'm using this code for Regression (based on this link Spark official site):

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("Year100")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize )

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
    }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

The dataset that I'm using can be seen here: Pastebin link.

So my question is: why MSE equals as 889717.74 (which is a huge number)?

Edit: As the commentators suggested, I tried these:

1) I changed the step to default and the MSE now returns as NaN

2) If I try this constructor: LinearRegressionWithSGD.train(parsedData, numIterations,stepSize,intercept=True) the spark-shell returns an error (error: not found:value True)

Upvotes: 1

Views: 1137

Answers (1)

Tim
Tim

Reputation: 3725

You've passed a tiny step size and capped the number of iterations at 100. The maximum value by which your parameters can change is 0.00000001 * 100 = 0.000001. Try using the default step size, I imagine that will fix it.

Upvotes: 1

Related Questions