user838681
user838681

Reputation: 131

Spark streaming multiple sources, reload dataframe

I have a spark streaming context reading event data from kafka at 10 sec intervals. I would like to complement this event data with the existent data at a postgres table.

I can load the postgres table with something like:

val sqlContext = new SQLContext(sc)
val data = sqlContext.load("jdbc", Map(
  "url" -> url,
  "dbtable" -> query))

...

val broadcasted = sc.broadcast(data.collect())

And later I can cross it like this:

val db = sc.parallelize(data.value)
val dataset = stream_data.transform{ rdd => rdd.leftOuterJoin(db)}

I would like to keep my current datastream running and still reload this table every other 6 hours. Since apache spark at the moment doesn't support multiple running contexts how can I accomplish this? Is there any workaround? Or will I need to restart the server each time I want to reload the data? This seems such a simple use case... :/

Upvotes: 13

Views: 1617

Answers (1)

Shawn Guo
Shawn Guo

Reputation: 3228

In my humble opinion, reloading another data source during the transformations on DStreams is not recommended by design.

Compared to traditional stateful streaming processing models, D-Streams is designed to structure a streaming computation as a series of stateless, deterministic batch computations on small time intervals.

The transformations on DStreams are deterministic and this design enable the quick recover from faults by recomputing. The refreshing will bring side-effect to recovering/recomputing.

One workaround is to postpone the query to output operations for example: foreachRDD(func).

Upvotes: 1

Related Questions