RB.
RB.

Reputation: 37192

Prevent cascading failures in Kafka consumers

Imagine you have a Kafka consumer group with 3 members (M1, M2, and M3). Each member is running in it's own process, and each currently has one partition assigned (Pa, Pb, and Pc).

M1 receives a poison message from P1 which is crafted such that it triggers a stack overflow exception, killing M1. This will eventually trigger a rebalance, and M2 now has P1.

M2 will now receive the same poison message from P1 - and also die, triggering a rebalance and giving P1 to M3.

Finally, M3 will receive the same message and die.

At this point you have taken out your entire set of processors - and any new ones you spin up will also die until you have fixed the message in Kafka directly.

My question is - how does one prevent this cascading failure? I'm happy that the affected partition is ignored until the issue is resolved, and I can see how I would use the Pause functionality to achieve this in the case of a handled exception. However, I can't handle a stack overflow, so am not able to easily pause the partition.

Does Kafka have any mechanisms for handling this type of cascading failure?

Upvotes: 1

Views: 266

Answers (2)

Ashish Bhosle
Ashish Bhosle

Reputation: 657

One of the best question on Apache Kafka.

Well we can use assign(Collection partitions) method to avoid such scenarios. In this particular case we can do the following:

M1

    Consumer<K, V> m1 = getConsumer();
    TopicPartition tp = new TopicPartition("topic", 0);
    m1.assign(Arrays.asList(tp));

M2

    Consumer<K, V> m2 = getConsumer();
    TopicPartition tp = new TopicPartition("topic", 1);
    m2.assign(Arrays.asList(tp));

M3

    Consumer<K, V> m3 = getConsumer();
    TopicPartition tp = new TopicPartition("topic", 2);
    m3.assign(Arrays.asList(tp));

NOTE: Above code is just an example

You can find detailed explanation here

If you need any further help let me know. Happy to help.

Upvotes: 2

JohnsonCore
JohnsonCore

Reputation: 581

Not to be snarky, but I'd surmise the best way to prevent a stack overflow from breaking processes would be to prevent a stack overflow. Anything else would essentially be a band-aid.

It is virtually guaranteed that any exception, including stack overflows, that are encountered on one consumer will be encountered in all other instances of that consumer, given enough messages.

With that in mind, and that there are limited software-means of handling stack-overflow exceptions, the only path I could recommend in good consciousness is to pre-empt these kinds of exceptions before they happen.

If there are circumstances that prevent you from being able to prevent these exceptions, then more information might help us give more-detailed answers.

Upvotes: 0

Related Questions