big
big

Reputation: 1938

storm stream joins with parallelism

I have two spouts and both are emitting some data

Spout A tuple-> pid, data1, data2, data3
Spout B tuple -> pid, m1, m2

I want join the data from above two spouts in bolt over "pid"

Spout A
   |------------------> joinBolt ----> pid, data1, data2, data3, m1, m2
Spout B

JoinBolt will combine the data over "pid" and will emit tuple (pid, data1, data2, data3, m1, m2)

JoinBolt joinBolt = new JoinBolt()
BoltDeclarer bd = builder.setBolt("joinBoltId", joinBolt, 5); 
bd.fieldsGrouping("spout1Id" "stream1",  new Fields("pid"));
bd.fieldsGrouping("spout2Id", "stream2", new Fields("pid"));

If I have parallelism of 5 in JoinBolt, can I be sure that data from both spouts having same pid will land up on same instance of joinBolt.

In this case, since I have parallelism of 5, I will be having 5 instance of joinBolt (say b1, b2, b3, b4, b5). Now is it possible that pid1 from spout1 and pid1 from spout2 can go to different instance of joinBolt even if I have set fieldsGrouping on pid?

Upvotes: 0

Views: 90

Answers (1)

If you are using fieldsGrouping on pid, for same values pid it will go to the same instance of JoinBolt. FYI, Storm added JoinBolt functionality https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/bolt/JoinBolt.java based on windowing.

Upvotes: 1

Related Questions