tesnik03
tesnik03

Reputation: 1359

spark streaming with aggregation

I am trying to understand spark streaming in terms of aggregation principles. Spark DF are based on the mini batches and computations are done on the mini batch that came within a specific time window.

Lets say we have data coming in as -

    Window_period_1[Data1, Data2, Data3]
    Window_period_2[Data4, Data5, Data6] 

..

then first computation will be done for Window_period_1 and then for Window_period_2. If I need to use the new incoming data along with historic data lets say kind of groupby function between Window_period_new and data from Window_period_1 and Window_period_2, how would I do that?

Another way of seeing the same thing would be lets say if I have a requirement where a few data frames are already created -

df1, df2, df3 and I need to run an aggregation which will involve data from df1, df2, df3 and Window_period_1, Window_period_2, and all new incoming streaming data

how would I do that?

Upvotes: 5

Views: 874

Answers (1)

Natalia
Natalia

Reputation: 4532

Spark allows you to store state in rdd (with checkpoints). So, even after restart, job will restore it state from checkpoint and continie streaming.

However, we faced with performance problems with checkpoint (specially, after restoring state), so it is worth to implement storint state using some external source (like hbase)

Upvotes: 2

Related Questions