Pyspark wrapper for H2O POJO

I created model using H2O's Sparkling Water. And now I'd like to apply it to huge Spark DF (populated with sparse vectors). I use python and pyspark, pysparkling. Basically I need to do map job with model.predict() function inside. But copying data into H2O context is huge overhead and not an option. What I think I gonna do is, extract POJO (Java class) model from h2o model and use it to do map in dataframe. My questions are:

Is there a better way?
How to write pyspark wrapper for java class, from which I intend to use only one method .score(double[] data, double[] result)
How to maximally reuse wrappers from Spark ML library?

Thank you!

Upvotes: 1

Answers (1)

Michal

Reputation: 437

In this case, you can:

1) use h2o.predict(H2OFrame) method to generate prediction, but you need to transform RDD to H2OFrame. It is not the perfect solution...however, for some cases, it can provide reasonable solution.

2) switch to JVM and call JVM directly via Spark's Py4J gateway This is not fully working solution right now, since the method score0 needs to accept non-primitive types on H2O side and also to be visible (right now it is protected), but at least idea:

model = sc._jvm.water.DKV.getGet("deeplearning.model")
double_class = sc._jvm.double
row = sc._gateway.new_array(double_class, nfeatures)
row[0] = ...
...
row[nfeatures-1] = ...
prediction = model.score0(row)

I created JIRA improvement for this case https://0xdata.atlassian.net/browse/PUBDEV-2726

However, workaround is to create a Java wrapper around model which would expose right shape of score0 function:

class ModelWrapper extends Model {
   public double[] score(double[] row) {
     return score0(row)
   }
}

Please see also hex.ModelUtils: https://github.com/h2oai/sparkling-water/blob/master/core/src/main/scala/hex/ModelUtils.scala (again you can call them directly via Py4J gateway exposed by Spark)

Upvotes: 2

Pyspark wrapper for H2O POJO

Answers (1)

Related Questions