Reputation: 71
We have very simple Spark Streaming job (implemented in Java), which is:
The JSONs are generated with set of 1000 IDs (keys) and all the events are randomly distributed to Kafka topic partitions. This also means, that resulting set of objects is max 1000, as the job is storing only the object with highest quality for each ID.
We were running the performance tests on AWS EMR (m4.xlarge = 4 cores, 16 GB memory) with following parameters:
Kafka cluster contains just 1 broker, which is utilized to max ~30-40% during the peak load (we're pre-filling the data to topic and then independently executing the test). We have tried to increase the num.io.threads and num.network.threads, but without significant improvement.
The he results of performance tests (about 10 minutes of continuous load) were (YARN master and Driver nodes are on top of the node counts bellow):
The CPU utilization in case of 2 nodes was ~
We also played around other settings including: - testing low/high number of partitions - testing low/high/default value of defaultParallelism - testing with higher number of executors (i.e. divide the resources to e.g. 30 executors instead of 10) but the settings above were giving us the best results.
So - the question - is Kafka + Spark (almost) linearly scalable? If it should be scalable much better, than our tests shown - how it can be improved. Our goal is to support hundreds/thousands of Spark executors (i.e. scalability is crucial for us).
Upvotes: 3
Views: 1297
Reputation: 71
We have resolved this by:
At the end, we've been able to achieve 1 100 000 events/s processed by the spark cluster with 10 executor nodes. The tuning made also increased the performance on configurations with less nodes -> we have achieved practically linear scalability when scaling from 2 to 10 spark executor nodes (m4.xlarge on AWS).
At the beginning, CPU on Kafka node wasn't approaching to limits, however it was not able to respond to the demands of Spark executors.
Thnx for all suggestions, particulary for @ArturBiesiadowski, who suggested the Kafka cluster incorrect sizing.
Upvotes: 4