Reputation: 275
SO im trying to build an SVM ensemble Application.
Im having issues with the predict method.
I've based it on the predictByVoting the native tree ensembles use. The SVM models are loaded in sequence into an array in the driver. The weights are calculated and stored into another array called modelWeights
private def predictByVoting(features: Vector): Double = {
val votes = mutable.Map.empty[Int, Double]
models.view.zip(modelWeights).foreach { case (svmmodel, weight) =>
val prediction = svmmodel.predict(features).toInt
votes(prediction) = votes.getOrElse(prediction, 0.0) + weight
}
votes.maxBy(_._2)._1}
This is called on an RDD[LabeledPoints] Test. 4.9 million rows, 11 725 480 features.
val scoresAndLables= test.map{point=>
val prediction= predictByVoting(point.features)
(prediction,point.label)
}
The executors run out of memory when there are to many models. My guess is that this is caused by spark sending the serialized models at every point, eventually causing OOM.
I've tried coalesce to fewer partitions and repartition to increase the partitions.
Is this a limitation in the native tree Ensembles as well? How do they prevent these out of memory errors in the ensemble methods? AM i storing the models and weights in a bad way?
Thanks
Upvotes: 0
Views: 95
Reputation: 275
The issues was that the array of models and weights was sent to the executors at each point. Causing OOM.
i solved this by broadcasting the models and rates,
val bModelsAndRates=sc.broadcast(models.view.zip(modelWeights))
This way the models are sent once, limiting the network IO.
Upvotes: 2