Reputation: 191
When building a Kafka Streams topology, reads from multiple topics can be modeled in two different ways:
topologyBuilder.addSource("sourceName", ..., "topic1", "topic2", "topic3");
topologyBuilder.addSource("sourceName1", ..., "topic1")
.addSource("sourceName2", ..., "topic2")
.addSource("sourceName3", ..., "topic3");
Is there a relative advantage of option1 over option2 or vice versa? All topics contain the same type of data and have the same data processing logic.
Upvotes: 0
Views: 5830
Reputation: 775
There are several other factors to consider.
If your input data is uniformly distributed between input topics (by the size and the rate of messages), then go for option 1, because of its simplicity. If not, then the "slow" topics will slow down your overall consumption, so to achieve smaller delays on "fast" topics go for option 2.
If you run several such topologies in parallel on different nodes (for high availability or high throughput), then having one consumer group (option 1) will result in more consumers to coordinate within it. In my experience this also slows down consumption, especially when you restart consumers (or if they fall out). In this case I also go for option 2: less consumers in a group require less effort to coordinate, shorter delays.
Upvotes: 2
Reputation: 3242
Given that, as you state, all input topics contain the same kind of data and subsequent processing of the data is equivalent, you should most probably go with option 1, for the following two reasons:
1) this will result in a smaller topology
2) you would only need to connect one source node to your subsequent processing steps
In case processing will need to be different for the different source topics at a later point in time, you could then split up the source node into multiple ones.
Upvotes: 2