Pausing Stream Consumption

Question

I am working on an application that processes very few records in a minute. The request rate would be around 2 calls per minute. These requests are create and update made for a set of data. The requirements were delivery guarantee, reliable delivery, ordering guarantee and preventing any loss of messages.

Our team has decided to use Kafka and I think it does not fit the use case since Kafka is best suitable for streaming data. Instead we could have been better off with traditional message model as well. Though Kafka does provide ordering per partition, the same can be achieved on a traditional messaging system if the number of messages Is low and sources of data is also low. Would that be a fair statement ?
We are using Kafka streams for processing the data and the processing requires that we do lookups to external systems. If the external systems are not available then we stop processing and automatically deliver messages to target systems when the external lookup systems are available. At the moment, we stop processing by continuously looping in the middle of a processing and checking if the systems are available. a) Is that the best way to stop stream midway while processing so that it doesn't pick up any more messages ? b) Are data stream frameworks even designed to be stopped or paused midway so they stop consuming the stream completely for some time ?

miguno · Accepted Answer

Regarding your point 2:

a) Is that the best way to stop stream midway while processing so that it doesn't pick up any more messages ?

If, as in your case, you have a very low incoming data rate (a few records per minute), then it might be ok to pause processing an input stream when required dependency systems are not available currently.

In Kafka Streams the preferable API to implement such a behavior -- which, as you are alluding to yourself, is not really a recommended pattern -- is the Processor API.

Even so there are a couple of important questions you need to answer yourself, such as:

What is the desired/required behavior of your stream processing application if the external systems are down for extended periods of time?
Could the incoming data rate increase at some point, which could mean that you would need to abandon the pausing approach above?

But again, if pausing is what you want or need to do, then you can give it a try.

b) Are data stream frameworks even designed to be stopped or paused midway so they stop consuming the stream completely for some time ?

Some stream processing tools allow you to do that. Whether it's the best pattern to use them is a different question.

For instance, you could also consider the following alternative: You could automatically ingest the external systems' data into Kafka, too, for example via Kafka's built-in Kafka Connect framework. Then, in Kafka Streams, you could read this exported data into a KTable (think of this KTable as a continuously updated cache of the latest data from your external system), and then perform a stream-table join between your original, low-rate input stream and this KTable. Such stream-table joins are a common (and recommended) pattern to enrich an incoming data stream with side data (disclaimer: I wrote this article); for example, to enrich a stream of user click events with the latest user profile information. One of the advantages of this approach -- compared to your current setup of querying external systems combined with a pausing behavior -- is that your stream processing application would be decoupled from the availability (and scalability) of your external systems.

Pausing Stream Consumption

Answers (2)

Related Questions