What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

Question

I have a use case where I am joining a streaming DataFrame with a static DataFrame. The static DataFrame is read from a parquet table (a directory containing parquet files). This parquet data is updated by another process once a day.

My question is what would happen to my static DataFrame?

Would it update itself because of the lazy execution or is there some weird caching behavior that can prevent this?
Can the updation process make my code crash?
Would it be possible to force the DataFrame to update itself once a day in any way?

I don't have any code to share for this because I haven't written any yet, I am just exploring what the possibilities are. I am working with Spark 2.3.2

Ged · Accepted Answer

A big (set of) question(s).

I have not implemented all aspects myself (yet), but this is my understanding and one set of info from colleagues who performed an aspect that I found compelling and also logical. I note that there is not enough info out there on this topic.

So, if you have a JOIN (streaming --> static), then:

If standard coding practices as per Databricks applied and .cache is applied, the SparkStructuredStreamingProgram will read in static source only once, and no changes seen on subsequent processing cycles and no program failure.
If standard coding practices as per Databricks applied and caching NOT used, the SparkStructuredStreamingProgram will read in static source every loop, and all changes will be seen on subsequent processing cycles hencewith.
But, JOINing for LARGE static sources not a good idea. If large dataset evident, use Hbase, or some other other key value store, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was done by an airline company I worked at and was no easy task the data engineer, designer told me. Indeed, it is not that easy.

So, we can say that updates to static source will not cause any crash.

"...Would it be possible to force the DataFrame to update itself once a day in any way..." I have not seen any approach like this in the docs or here on SO. You could make the static source a dataframe using var, and use a counter on the driver. As the micro batch physical plan is evaluated and genned every time, no issue with broadcast join aspects or optimization is my take. Whether this is the most elegant, is debatable - and is not my preference.

If your data is small enough, the alternative is to read using a JOIN and thus perform the look up, via the use of the primary key augmented with some max value in a technical column that is added to the key to make the primary key a compound primary key - and that the data is updated in the background with a new set of data, thus not overwritten. Easiest in my view if you know the data is volatile and the data is small. Versioning means others may still read older data. That is why I state this, it may be a shared resource.

The final say for me is that I would NOT want to JOIN with the latest info if the static source is large - e.g. some Chinese companies have 100M customers! In this case I would use a KV store as LKP using mapPartitions as opposed to JOIN. See https://medium.com/@anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc that provides some insights. Also, this is old but still applicable source of information: https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/. Both are good reads. But requires some experience and to see the forest for the trees.

What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

Answers (1)

Related Questions