Apache Flink and Kafka Stream Benchmarking

Question

What I'm trying to do

For a university project i am are trying to compare Apache Flink and Apache Kafka Streaming Performance (Throughput, Latency) using different configurations (1 Nodes, 2 Nodes, 4 Nodes, changing amount of CPU cores etc.).

For this purpose i have created a twitter JSON dataset containing ~ 15.000 Tweets, each tweet delimited by a newline.

The Problem

As far as i know Kafka follows the pattern "Producer - Kafka Cluster/Brokers - Consumer" and, for benchmarking the latency, i would measure the time between Producer and Consumer for each record.

The problem is, as far as i can tell Apache Flink lacks the ability of the Producer pattern - it seems like i have to specify a source for the data stream which the TaskManagers ("Consumers") would then fetch and process.

This makes it hard for me to tell how i should benchmark both systems in a comparable way because, for the latency measurement, i would measure the time between Producer and Consumer whereas in Flink i would have to measure the time between the JobManager and TaskManagers. So the producer part would be missing here.

Assuming i haven't misunderstood something, how would i measure both systems in a comparable way in order to make reasonable judgements?

Apache Flink and Kafka Stream Benchmarking

What I'm trying to do

The Problem

Answers (1)

Related Questions