How to have a spark dataframe be constantly updated as writes occur in the db backend?

Question

Basically I have Spark sitting in front of a database and I was wondering how I would go about having the dataframe be constantly updated with new data from the backend.

The trivial way I can think of solving this would be to just run the query against the database every couple minutes but that is obviously inefficient and would still result in having stale data for the time between updates.

I am not 100% sure if the database I'm working with has this restriction but I think rows are only added, there are no modifications to existing rows.

ayan guha · Accepted Answer

DF is RDD+Schema+Many other functionalities. From basic spark design, RDD is immutable. Hence, you can not update a DF after it is materialized. In your case, you can probably mix a streaming + SQL like below:

In your DB, write data to a queue along with writes in tables
Use spark queue stream to consume from the queue and create Dstreams (RDDs every x seconds)
For each incoming RDD, join with existing DF and create a new DF

How to have a spark dataframe be constantly updated as writes occur in the db backend?

Answers (1)

Related Questions