Reputation: 93
I created model using H2O's Sparkling Water. And now I'd like to apply it to huge Spark DF (populated with sparse vectors). I use python and pyspark, pysparkling. Basically I need to do map job with model.predict() function inside. But copying data into H2O context is huge overhead and not an option. What I think I gonna do is, extract POJO (Java class) model from h2o model and use it to do map in dataframe. My questions are:
Thank you!
Upvotes: 1
Views: 595
Reputation: 437
In this case, you can:
1) use h2o.predict(H2OFrame)
method to generate prediction, but you need to transform RDD
to H2OFrame
. It is not the perfect solution...however, for some cases, it can provide reasonable solution.
2) switch to JVM and call JVM directly via Spark's Py4J gateway
This is not fully working solution right now, since the method score0
needs to accept non-primitive types on H2O side and also to be visible (right now it is protected),
but at least idea:
model = sc._jvm.water.DKV.getGet("deeplearning.model")
double_class = sc._jvm.double
row = sc._gateway.new_array(double_class, nfeatures)
row[0] = ...
...
row[nfeatures-1] = ...
prediction = model.score0(row)
I created JIRA improvement for this case https://0xdata.atlassian.net/browse/PUBDEV-2726
However, workaround is to create a Java wrapper around model which would
expose right shape of score0
function:
class ModelWrapper extends Model {
public double[] score(double[] row) {
return score0(row)
}
}
Please see also hex.ModelUtils
: https://github.com/h2oai/sparkling-water/blob/master/core/src/main/scala/hex/ModelUtils.scala
(again you can call them directly via Py4J gateway exposed by Spark)
Upvotes: 2