Reputation: 11
we have 3 different data sources which eventually we need to do some kind of inner join between them we create all pcollections with group by key pcollectionA - Implemented using state (the data is not changed) pcollectionB - windowed for 5h. if another event arrives within that time we would like to increase the window in another 5h hours. Implemented using custom window pcollectionC - windowed for 30 minutes. Implemented using fixed window.
our purpose is to send pcollectionC events only if the relevant key exist in pcollectionA & pcollectionB
what is the best way to implement it as regular joins are not working due do the window differences
Upvotes: 0
Views: 89
Reputation: 2835
From your question I am assuming that the first two streams (pcollectionA and pcollectionB) are more like control/reference data used by the stream processor of pcollectionC.
Have you tried the Broadcast State Pattern? You can convert pcollectionA and pcollectionB to Broadcast streams and then look them up in the processor of pcollectionC.
Here is a very high level design:
// broadcast pcollectionA
BroadcastStream<> A_BroadcastStream = pcollectionA_stream
.broadcast(stateDescriptor);
// broadcast pcollectionB
BroadcastStream<> B_BroadcastStream = pcollectionB_stream
.broadcast(stateDescriptor);
//Process pcollectionC with broadcast data from A and B.
DataStream<String> output = pcollectionC_Stream.
.connect(A_BroadcastStream)
.process(
// process/join with pcollectionA items
)
.connect(B_BroadcastStream)
.process(
// process/join with pcollectionB items
)
This documentation has a great example for this design pattern - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/broadcast_state/
Upvotes: 1