Read JSON files from Spark streaming into H2O

Question

I've got a cluster on AWS where I've installed H2O, Sparkling Water and H2O Flow for Machine Learning purposes on lots of data.

Now, these files come in a JSON format from a streaming job. Let's say they are placed in S3 in a folder called streamed-data.

From Spark, using the SparkContext, I could easily read them in one go to create an RDD as (this is Python, but is not important):

sc = SparkContext()
sc.read.json('path/streamed-data')

This reads them all, creates me the RDD and is very handy.

Now, I'd like to leverage the capabilities of H2O, hence I've installed it on the cluster, along with the other mentioned software.

Looking from H2O flow, my problem is the lack of a JSON parser, so I'm wondering if I could import them into H2O in the first place, or if there's anything I could do to go round the problem.

Mateusz Dymczyk · Accepted Answer

When running Sparkling Water you can convert RDD/DF/DS to H2O frames quite easily. Something like this (Scala, Python would look similar) should work:

val dataDF = sc.read.json('path/streamed-data')
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext.implicits._
val h2oFrame = h2oContext.asH2OFrame(dataDF, "my-frame-name")

From now on you can use the frame from code level and/or from FlowUI.

You can find more examples here for Python and here for Scala.

Read JSON files from Spark streaming into H2O

Answers (1)

Related Questions