Reputation: 463
There is a lots of hype over how good and fast spark is for processing large amount of data.
So, we wanted to investigate the query performance of spark.
4 worker nodes, r3.2xlarge instances
Our input data is stored in 12 splitted gzip files in S3.
We created a table using Spark SQL for the aforementioned input data set.
Then we cached the table. We found from Spark UI that Spark did not load all data into memory, rather it loaded some data into memory and some in disk. UPDATE: We also tested with parquet files. In this case, all data was loaded in memory. Then we execute the same queries as below. Performance is still not good enough.
Let's assume the table name is Fact_data. We executed the following query on that cached table:
select date_key,sum(value) from Fact_data where date_key between 201401 and 201412 group by date_key order by 1 The query takes 1268.93sec to complete. This is huge compared to the execution time in Redshift (dc1.large cluster) which takes only 9.23 sec. I also tested some other queries e.g, count, join etc. Spark is giving me really poor performance for each of the queries
Questions
Could you suggest anything that might improve the performance of the query? May be I am missing some optimization techniques. Any suggestion will be highly appreciated.
How to compel Spark to load all data in memory? Currently it stored some data in memory and some in disk.
Is there any performance difference in using Dataframe and SQL table? I think, no. Because under the hood they are using the same optimizer.
Upvotes: 2
Views: 1227
Reputation: 21
I suggest you use Parquet as your file format instead of gzipped files.
you can try increasing your --num-executors, --executor-memory and --executor-cores
if you're using YARN and your instance type is r3.2xlarge, make sure you container size yarn.nodemanager.resource.memory-mb is larger than your --executor-memory (maybe around 55G) you also need to set yarn.nodemanager.resource.cpu-vcores to 15.
Upvotes: 1