Reputation: 46878
Assume I have two Kafka topics, A and B. I am trying to develop a system that pulls records from A, applies a transformation to each record, then publishes the transformed records to B. In this case, the transformation involves calling a REST endpoint over HTTP.
Being relatively new to Kafka, I was glad to see that the Kafka Streams project already solved this type of problem (consume-transform-publish). Unfortunately, I discovered that transformations in Kafka streams are blocking operations. Instinctively, I try to call HTTP endpoints in a non-blocking, asynchronous manner.
Does this mean that Kafka Streams will not work in this situation? Does this mean that I must revert back to calling the REST endpoint in a blocking manner? Is this even an acceptable pattern for Kafka Streams? Stream-based data processing is still relatively new to me, so I am not entirely familiar with its concurrency models.
Upvotes: 7
Views: 4333
Reputation: 38113
Update: after looking in to this further, I am not sure that this is the right answer...
I am new to Kafka and Kafka Streams (hereafter referred to as "Kafka"), but having encountered and considered similar questions, here is my perspective:
Kafka has two salient features:
Many really nice properties fall out from these features. For example, stream-based "transactions", I think, is one of the coolest.
But whether these properties are actually "features" in the sense that you want them, of course, depends on the application. If you don't want strongly ordered processing with parallelism based on topic partitioning, then you might not want to be using Kafka for that application.
So, with regard to:
Does this mean that Kafka Streams will not work in this situation?
It will work, but increased parallelism is achieved through increased partitioning.
Does this mean that I must revert back to calling the REST endpoint in a blocking manner?
Yes, I think it does—but I'm not sure why that would be a "reversion". Personally, that's what I like about Kafka: blocking code is simpler. If I want more parallelism, I can run more threads. There's no shared state, after all.
Upvotes: 4