vicg
vicg

Reputation: 1368

How to have a spark dataframe be constantly updated as writes occur in the db backend?

Basically I have Spark sitting in front of a database and I was wondering how I would go about having the dataframe be constantly updated with new data from the backend.

The trivial way I can think of solving this would be to just run the query against the database every couple minutes but that is obviously inefficient and would still result in having stale data for the time between updates.

I am not 100% sure if the database I'm working with has this restriction but I think rows are only added, there are no modifications to existing rows.

Upvotes: 0

Views: 568

Answers (1)

ayan guha
ayan guha

Reputation: 1257

DF is RDD+Schema+Many other functionalities. From basic spark design, RDD is immutable. Hence, you can not update a DF after it is materialized. In your case, you can probably mix a streaming + SQL like below:

  1. In your DB, write data to a queue along with writes in tables
  2. Use spark queue stream to consume from the queue and create Dstreams (RDDs every x seconds)
  3. For each incoming RDD, join with existing DF and create a new DF

Upvotes: 1

Related Questions