Spark Poor Query performance: How to improve query performance on Spark?

Question

There is a lots of hype over how good and fast spark is for processing large amount of data.

So, we wanted to investigate the query performance of spark.

Machine configuration:

4 worker nodes, r3.2xlarge instances

Data

Our input data is stored in 12 splitted gzip files in S3.

What we did

We created a table using Spark SQL for the aforementioned input data set.

Then we cached the table. We found from Spark UI that Spark did not load all data into memory, rather it loaded some data into memory and some in disk. UPDATE: We also tested with parquet files. In this case, all data was loaded in memory. Then we execute the same queries as below. Performance is still not good enough.

Query Performance

Let's assume the table name is Fact_data. We executed the following query on that cached table:

select date_key,sum(value) from Fact_data where date_key between 201401 and 201412 group by date_key order by 1 The query takes 1268.93sec to complete. This is huge compared to the execution time in Redshift (dc1.large cluster) which takes only 9.23 sec. I also tested some other queries e.g, count, join etc. Spark is giving me really poor performance for each of the queries

Questions
1. Could you suggest anything that might improve the performance of the query? May be I am missing some optimization techniques. Any suggestion will be highly appreciated.
2. How to compel Spark to load all data in memory? Currently it stored some data in memory and some in disk.
3. Is there any performance difference in using Dataframe and SQL table? I think, no. Because under the hood they are using the same optimizer.

Spark Poor Query performance: How to improve query performance on Spark?

Answers (1)

Related Questions