Luniam
Luniam

Reputation: 463

Spark Poor Query performance: How to improve query performance on Spark?

There is a lots of hype over how good and fast spark is for processing large amount of data.

So, we wanted to investigate the query performance of spark.

4 worker nodes, r3.2xlarge instances

Our input data is stored in 12 splitted gzip files in S3.

We created a table using Spark SQL for the aforementioned input data set.

Then we cached the table. We found from Spark UI that Spark did not load all data into memory, rather it loaded some data into memory and some in disk. UPDATE: We also tested with parquet files. In this case, all data was loaded in memory. Then we execute the same queries as below. Performance is still not good enough.

Let's assume the table name is Fact_data. We executed the following query on that cached table:

select date_key,sum(value) from Fact_data where date_key between 201401 and 201412 group by date_key order by 1 The query takes 1268.93sec to complete. This is huge compared to the execution time in Redshift (dc1.large cluster) which takes only 9.23 sec. I also tested some other queries e.g, count, join etc. Spark is giving me really poor performance for each of the queries

Upvotes: 2

Views: 1227

Answers (1)

Joey Van Halen
Joey Van Halen

Reputation: 21

  1. I suggest you use Parquet as your file format instead of gzipped files.

  2. you can try increasing your --num-executors, --executor-memory and --executor-cores

  3. if you're using YARN and your instance type is r3.2xlarge, make sure you container size yarn.nodemanager.resource.memory-mb is larger than your --executor-memory (maybe around 55G) you also need to set yarn.nodemanager.resource.cpu-vcores to 15.

Upvotes: 1

Related Questions