spark streaming with aggregation

Question

I am trying to understand spark streaming in terms of aggregation principles. Spark DF are based on the mini batches and computations are done on the mini batch that came within a specific time window.

Lets say we have data coming in as -

    Window_period_1[Data1, Data2, Data3]
    Window_period_2[Data4, Data5, Data6] 

..

then first computation will be done for Window_period_1 and then for Window_period_2. If I need to use the new incoming data along with historic data lets say kind of groupby function between Window_period_new and data from Window_period_1 and Window_period_2, how would I do that?

Another way of seeing the same thing would be lets say if I have a requirement where a few data frames are already created -

df1, df2, df3 and I need to run an aggregation which will involve data from df1, df2, df3 and Window_period_1, Window_period_2, and all new incoming streaming data

how would I do that?

spark streaming with aggregation

Answers (1)

Related Questions