Reputation: 3270
I'm evaluating Spark/Cassandra and Elasticsearch to decide which one to go for.
Right now, I'm using spark and Cassandra to generate different reports but I noticed with 2 million records (aprrox. 400 columns), it takes about 9.7, 9.8, 9.9, 10 and 10 mins to generate the 5 reports respectively.
Changing the scheduling mode "spark.scheduler.mode", "FAIR"
doesn't seems to make a big difference.
I'm thinking of loading all the data in memory and caching it so subsequent query can run faster if the data are preloaded memory.
However, running the same report in Elasticsearch takes just 2mins.
Any ideas on what I can do to improve spark response time ?
Upvotes: 0
Views: 1493
Reputation: 41
Spark can not pushdown the operation of "group by".I found another bottleneck that affected transmission speed.Set "es.scroll.size" to 10000.It can increase the transmission speed from ES to spark.
Upvotes: 1
Reputation: 16576
To start, I wouldn't consider ElasticSearch vs Spark to really be a good comparison since the two systems are really targeting different use cases. Elastic Search being focused on search and quick retrieval and Spark being a general purpose analytics framework with a focus on very large datasets.
But as to how to make your Spark reports run quicker. The key things to do with C* are to make sure that the spark.cassandra.input.split.size
is small enough so that you get enough Spark tasks to fully exploit the parallelism in your cluster. After doing this you can consider caching the read table in memory for much faster access.
Upvotes: 3