Reputation: 5389
I want to study impact how additional training data helps the model performance (in terms of precision, recall etc.). I vary the sampling ratio as 0.35, 0.5, 0.75 and 1.0 (from 25% to 100% of all the data).
val sampling_ratio = 0.25
Read cases and controls from separate files.
val negative_training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "negative_sorted.tsv")
val positive_training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "positive_sorted.tsv")
Taking a random subset (25% for now) of the dataset for both positive and negative entries.
val negative_split = negative_training_data.randomSplit(Array(sampling_ratio, (1 - sampling_ratio)), seed = sample)(0)
val positive_split = positive_training_data.randomSplit(Array(sampling_ratio, (1 - sampling_ratio)), seed = sample)(0)
Here is where I combine the two splits to generate the training data.
val training_data: RDD[LabeledPoint] = negative_split.union(positive_split)
Now train the LogisticRegression model.
logrmodel = train_LogisticRegression_model(training)
Here is the code for model building.
def train_LogisticRegression_model(training: RDD[LabeledPoint]): LogisticRegressionModel = {
// Run training algorithm to build the model
val numIterations = 100
val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training)
return model
}
However, I get the following error:
Exception in thread "main" org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:984) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1390) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.lang.IllegalArgumentException: requirement failed: Dimensions mismatch when merging with another summarizer. Expecting 4701 but got 4698. at scala.Predef$.require(Predef.scala:233)
Upvotes: 0
Views: 1418
Reputation: 1216
(You some some typos above and you haven't pasted the code of train_LogisticRegression_model
.)
The error tells you you have different size vectors in positive and negative examples. You should check the size of the features as a sanity check on your inputs.
negative_training_data.take(3).map( _ .features.size).mkString("\n")
positive_training_data.take(3).map( _ .features.size).mkString("\n")
Upvotes: 0