Reputation: 150
Repartitioning a high-volume topic in Kafka Streams could be very expensive. One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Let me clarify my question. Suppose I have a simple aggregation like that (details omitted for brevity):
builder
.stream("messages")
.groupBy((key, msg) -> msg.field)
.count();
Given this code, Kafka Streams would read messages
topic and immediately write messages back to internal repartitioning topic, this time partitioned by msg.field
as a key.
One simple way to render this round-trip unnecessary is to write the original messages
topic partitioned by the msg.field
in the first place. But Kafka Streams knows nothing about messages
topic partitioning and I've found no way to tell it how the topic is partitioned without causing real repartition.
Note that I'm not trying to eliminate the partitioning step completely as the topic has to be partitioned to compute keyed aggregations. I just want to shift the partitioning step upstream from the Kafka Streams application to the original topic producers.
What I'm looking for is basically something like this:
builder
.stream("messages")
.assumeGroupedBy((key, msg) -> msg.field)
.count();
where assumeGroupedBy
would mark stream as already partitioned by msg.field
. I understand this solution is kind of fragile and would break on partitioning key mismatch, but it solves one of the problems when processing really large volumes of data.
Upvotes: 1
Views: 864
Reputation: 15087
Update after question was updated: If your data is already partitioned as needed, and you simply want to aggregate the data without incurring a repartitioning operation (both are true for your use case), then all you need is to use groupByKey()
instead of groupBy()
. Whereas groupBy()
always results in repartitioning, its sibling groupByKey()
assumes that the input data is already partitioned as needed as per the existing message key. In your example, groupByKey()
would work if key == msg.field
.
Original answer below:
Repartitioning a high-volume topic in Kafka Streams could be very expensive.
Yes, that's right—it could be very expensive (e.g., when high volume means millions of event per second).
Is there a way to tell Kafka Streams DSL that my source topic is already partitioned by the given key and no repartition is needed?
Kafka Streams does not repartition the data unless you instruct it; e.g., with the KStream#groupBy()
function. Hence there is no need to tell it "not to partition" as you say in your question.
One solution is to partition the topic by a key on the producer’s side and ingest an already partitioned topic in Streams app.
Given this workaround idea of yours, my impression is that your motivation for asking is something else (you must have a specific situation in mind), but your question text does not make it clear what that could be. Perhaps you need to update your question with more details?
Upvotes: 2