KStream windowed groupBy shared cache/data

Question

I have a scenario where I have a topic_source which will have all generated messages by another application, these json/messages might be duplicated, so i need to deduplicate the messages `based on "window" size , say for every 10 sec , if there is any duplicates from topic_source , i will send deduplicate (based on message_id) messages to topic_target.

For the same i am using KStream, reading from topic_source , I am grouping by message_id using aggregation "count" , for each entry I am sending one message to topic_target.

Some thing like below

final KStream msgs = builder.stream("topic_source",Serdes.String());

final KTable counts = clickEvents .groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(10))) .count();

counts.toStream() .map((key, value) -> KeyValue.pair( key.key(), new Output(key.key(), value, key.window().start()))) .to("topic_target", Produced.with(Serdes.Integer(), new JsonSerde<>(Output.class)));

This is working fine in my local (windows standalone eclipse IDE ) machine ( when tested ).

But when I deploy the service/application on kubernatics pods , when I test , i found topic_target recieve as many meesages topic_source. ( no deduplication ) is happening.

I think , topic_source messages going/processed on different pods , where aggression of cumulative pods not resulting into single group by (message_id) set, i.e. each pod (group by of same message_id ) sending its own deduplicate messages to topic_target, where accumulated result result into duplicates.

Is there any way to solve this issue on kubernatics cluster ? i.e. is there any way all pods togther groupBy on one set , and send one distinct/deduplicated messages set to topic_target ?

This to achieve , what features of kubernatics/dockers should i use ? should there be any design machanisum/pattern I should follow ?

Any advice highly thankful.

cmcnealy · Accepted Answer

There are two things that jump to mind:

It looks like you'll be creating a "Tumbling Window" which means that you have non-overlapping windows every 10 seconds. So, if I send two messages one second apart (eg. at seconds 9.5 and 10.5) they will land in different windows and both be sent.
Did you set the application.id properly for all of the pods? If the application.id is different across pods, each of them will process all of the messages once. If it's the same, then the messages will be split between the pods.

KStream windowed groupBy shared cache/data

Answers (2)

Related Questions