Spark streaming multiple sources, reload dataframe

Question

I have a spark streaming context reading event data from kafka at 10 sec intervals. I would like to complement this event data with the existent data at a postgres table.

I can load the postgres table with something like:

val sqlContext = new SQLContext(sc)
val data = sqlContext.load("jdbc", Map(
  "url" -> url,
  "dbtable" -> query))

...

val broadcasted = sc.broadcast(data.collect())

And later I can cross it like this:

val db = sc.parallelize(data.value)
val dataset = stream_data.transform{ rdd => rdd.leftOuterJoin(db)}

I would like to keep my current datastream running and still reload this table every other 6 hours. Since apache spark at the moment doesn't support multiple running contexts how can I accomplish this? Is there any workaround? Or will I need to restart the server each time I want to reload the data? This seems such a simple use case... :/

Spark streaming multiple sources, reload dataframe

Answers (1)

Related Questions