Kleyson Rios
Kleyson Rios

Reputation: 2877

Understanding some concepts and Hazelcast Jet integrated with Kafka

I'm trying to map some concepts between Spark Structured Streaming and Hazelcast Jet, and understand another subjects as well.

Q1 - In the Spark, each Kafka partition will become a partition inside spark, then they will be processed by individual tasks in parallel. I think that I've read somewhere that Hazelcast Jet will merge all the messages from kafka regardless the group.id and topic partitions, is that correct ?

Q2 - How do we grow the number of "consumers" in a Jet program to increase the throughput consuming from kafka ? In Spark I guess we only need to grow the number of topic partitions in order to a new spark task be assigned for the new partition.

Q3 - If the Q1 above is true, is it possible avoid that merge and distribute the kafka partitions to be processed in parallel ? Once the messages will be already grouped and ordered in a kafka partition, having all the messages merged imply extra processing to re-partition and sort the messages again.

Q4 - How is defined the number of each vertex ? I mean, in the word count example we have the tokenizer and accumulator, how Jet will decide/divide the number of processors to creates instances of tokenizer's and accumulator's ?

Upvotes: 2

Views: 614

Answers (1)

Oliv
Oliv

Reputation: 10822

A1 - The number of parallel processors is totally independent from the number of Kafka partitions. The number of processors is determined by local parallelism of the vertex and the number of members:

totalParallelism = numberOfMembers * localParallelism

Each processor will be assigned a subset of all topic-partitions and uses one KafkaConsumer. group.id is not used, Jet uses manual partition assignment.

A2 - Adding new partitions to Kafka topic doesn't increase the number of consumers. You need to increase the local parallelism.

A3 - There's no extra cost of "merging" and "sorting". You might want to have a look here. Basically each vertex is backed by multiple parallel processors and each edge is backed by multiple queues, one queue for each two processors. If downstream processor takes items from multiple queues, it just does so queue by queue; there's no extra cost for merging. There's also no sorting in the sense that items are reordered. If the edge is not distributed, all processing is local and nothing is serialized.

Answer is valid for Jet 0.5.1 and 0.6 (which is in development at the time of writing).

A4 - see A1.

Upvotes: 1

Related Questions