Reputation: 7784
Below is the high level usecase which im trying to workon.
we have stream of students data published into a Kafka topic and our module has to read the student ids as stream and fetch associated data from multiple sources for each student and perform some calculation for each student and publish the associated calculation for each student into a kafka topic.
So here the question is it better to write a single big Spark job or use Akka to have separate service for each source so that actors can work parallely take bunch of student ids and get the data from respective source and perform some bunch Transformations and actions and finally a calculation associated with each student .
Or do i really need to use Akka here? Will Spark handles this efficiently internally?
Appreciate any thoughts here.
Upvotes: 1
Views: 683
Reputation: 4334
If your transformations take data from Kafka as input and produce output back into Kafka, it appears the most natural fit is Kafka Streams. I'd look to that first. Kafka Streams take advantage of the partitioning of data on Kafka to process partition groups in parallel to each other, but process messages sequentially within in each group, similarly how akka actors work in parallel to each other but each actor internally processes messages sequentially.
However, if your calculation requires e.g. machine learning or in general some iterative data-processing which does re-partitioning (shuffling in spark lingo) of the data between iterations, then Kafka Streams would no longer be that good a fit, I think. Then I'd consider Spark or Flink.
Akka is really powerful and you can use it in both these cases and more. However, it's a lower level library than Kafka Streams, Spark or Flink. Which means you have more power but also more considerations to think about. If using akka, I'd go for akka-streams. They have a good integration with kafka via the akka-stream-kafka (aka reactive-kafka) library.
Upvotes: 1