Spark: Dimensions mismatch when merging with another summarizer

Question

I want to study impact how additional training data helps the model performance (in terms of precision, recall etc.). I vary the sampling ratio as 0.35, 0.5, 0.75 and 1.0 (from 25% to 100% of all the data).

val sampling_ratio = 0.25

Read cases and controls from separate files.

val negative_training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "negative_sorted.tsv")
val positive_training_data:  RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "positive_sorted.tsv")

Taking a random subset (25% for now) of the dataset for both positive and negative entries.

val negative_split = negative_training_data.randomSplit(Array(sampling_ratio, (1 - sampling_ratio)), seed =  sample)(0)
val positive_split = positive_training_data.randomSplit(Array(sampling_ratio, (1 - sampling_ratio)), seed = sample)(0)

Here is where I combine the two splits to generate the training data.

 val training_data: RDD[LabeledPoint] = negative_split.union(positive_split)

Now train the LogisticRegression model.

 logrmodel = train_LogisticRegression_model(training)

Here is the code for model building.

  def train_LogisticRegression_model(training: RDD[LabeledPoint]): LogisticRegressionModel = {
    // Run training algorithm to build the model
    val numIterations = 100
    val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training)
    return model

  }

However, I get the following error:

Exception in thread "main" org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:984) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1390) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.lang.IllegalArgumentException: requirement failed: Dimensions mismatch when merging with another summarizer. Expecting 4701 but got 4698. at scala.Predef$.require(Predef.scala:233)

Spark: Dimensions mismatch when merging with another summarizer

Answers (1)

Related Questions