Spring Cloud Stream Kafka Streams: The number of downstream messages doesn't match the sum of messages sent to the topic

Question

I have a Spring Boot based Spring Cloud Stream Kafka Streams Binder application. It defines a topology with the following piece in it:

The numbers in green show the number of messages passed through the topology defined by the respective processors bound via Spring Cloud Stream Kafka Streams binder, here are the respective properties:

spring.cloud.stream.bindings:
  ...
  hint1Stream-out-0:
    destination: hints
  realityStream-out-0:
    destination: hints
  countStream-in-0:
    destination: hints

I am counting the messages that each processor produces / consumes using peek() methods as following:

return stream -> {
    stream
        .peek((k, v)-> input0count.incrementAndGet())
        ...
        .peek((k, v)-> output0count.incrementAndGet())
};

I am starting my application from a unit test using Embedded Kafka with pretty much default settings:

@RunWith(SpringRunner.class)
@SpringBootTest(
    properties = "spring.cloud.stream.kafka.binder.brokers=${spring.embedded.kafka.brokers}"
)
@EmbeddedKafka(partitions = 1,
        topics = {
                ...
                TOPIC_HINTS
        }
)
public class MyApplicationTests {
...

In my test I am waiting sufficiently long until all published test messages reach the countStream:

CountDownLatch latch = new CountDownLatch(1);
...
publishFromCsv(...)
...
latch.await(30, TimeUnit.SECONDS);
logCounters();

As you can see, the sum of the messages put into the "hints" topic doesn't match the count of messages on the "counterStream" side: 1309 + 2589 != 3786

I am probably missing some Kafka or Kafka Streams setting to flush every batch? Maybe my custom TimestampExtractor generates timestamps "too old"? (I'm pretty sure they are not less zero) Maybe this has something to do with the Kafka log compaction?

What could probably be the reason for this mismatch?

Update

Checked the underlying topic offsets by executing

kafka-run-class kafka.tools.GetOffsetShell --broker-list localhost:60231 --topic hints

while the test was waiting for timeout.

The number of messages in the topic is equal to the sum of two input streams counts, as expected. The number of messages passed arrived at the counterStream input is still a couple of dozens less than expected.

Other Kafka configuration in use:

spring.cloud.stream.kafka.streams:
    configuration:
      schema.registry.url: mock://torpedo-stream-registry
      default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
      default.value.serde: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde
      commit.interval.ms: 100

That corresponds to processing.guarantee = at_least_once. Could not test processing.guarantee = exactly_once since that requires a cluster of at least 3 brokers available.

Setting both:

spring.cloud.stream.kafka.binder.configuration:
  auto.offset.reset: earliest
spring.cloud.stream.kafka.streams.binder.configuration:
  auto.offset.reset: earliest
spring.cloud.stream.kafka.streams:
  default:
    consumer:
      startOffset: earliest
spring.cloud.stream.bindings:
  countStream-in-0:
    destination: hints
    consumer:
      startOffset: earliest
      concurrency: 1

didn't help :(

What helped was to only leave stream.peak(..) in the countStream consumer like:

@Bean
public Consumer> countStream() {
    return stream -> {
        KStream kstream = stream.peek((k, v) -> input0count.incrementAndGet());
    };
}

In this case I immediately start getting expected number of messages counted on the countConsumer side.

That means that my Count Consumer internals have impact on the behaviour.

Here is its full version which "doesn't work":

@Bean
public Consumer> countStream() {
    return stream -> {
        KStream kstream = stream.peek((k, v) -> notifyObservers(input0count.incrementAndGet()));

        KStream realityStream = kstream
            .filter((key, hint) -> realityDetector.getName().equals(hint.getDetector()));

        KStream hintsStream = kstream
            .filter((key, hint) -> !realityDetector.getName().equals(hint.getDetector()));

        this.countsTable = kstream
            .groupBy((key, hint) -> key.concat(":").concat(hint.getDetector()))
            .count(Materialized
                .as("countsTable"));

        this.countsByActionTable = kstream
            .groupBy((key, hint) -> key.concat(":")
                .concat(hint.getDetector()).concat("|")
                .concat(hint.getHint().toString()))
            .count(Materialized
                .as("countsByActionTable"));

        this.countsByHintRealityTable = hintsStream
            .join(realityStream,
                (hint, real) -> {
                    hint.setReal(real.getHint());
                    return hint;
                }, JoinWindows.of(countStreamProperties.getJoinWindowSize()))
            .groupBy((key, hint) -> key.concat(":")
                .concat(hint.getDetector()).concat("|")
                .concat(hint.getHint().toString()).concat("-")
                .concat(hint.getReal().toString())
            )
            .count(Materialized
                .as("countsByHintRealityTable"));

    };
}

I am storing counts in several KTables there. This is what is happening inside of the Counts Consumer:

Update 2

The last piece of the Count Consumer is apparently causing the initial unexpected behaviour:

this.countsByHintRealityTable = hintsStream
        .join(realityStream,
            (hint, real) -> {
                hint.setReal(real.getHint());
                return hint;
            }, JoinWindows.of(countStreamProperties.getJoinWindowSize()))
        .groupBy((key, hint) -> key.concat(":")
            .concat(hint.getDetector()).concat("|")
            .concat(hint.getHint().toString()).concat("-")
            .concat(hint.getReal().toString())
        )
        .count(Materialized
            .as("countsByHintRealityTable"));

Without it the message counts match as expected.

How can such downstream code affect the Consumer KStream input?

abinet · Accepted Answer

The messages can be deleted because of the retention policy. Changing topology reflects in changing the amount of time needed for processing. If retention appears during the processing you can loose the messages. It also depends on offset reset policy.

Try to set log.retention.hours=-1. This is going to disable the retention for auto created topics.

Spring Cloud Stream Kafka Streams: The number of downstream messages doesn't match the sum of messages sent to the topic

Answers (2)

Related Questions

Spring Cloud Stream Kafka Streams: The number of downstream messages doesn&#39;t match the sum of messages sent to the topic

Answers (2)

Related Questions

Spring Cloud Stream Kafka Streams: The number of downstream messages doesn't match the sum of messages sent to the topic