aladeen
aladeen

Reputation: 307

How to optimize spark sql operations on large data frame?

I have a large hive table(~9 billion records and ~45GB in orc format). I am using spark sql to do some profiling of the table.But it takes too much time to do any operation on this. Just a count on the input data frame itself takes ~11 minutes to complete. And min, max and avg on any column alone takes more than one and half hours to complete.

I am working on a limited resource cluster (as it is the only available one), a total of 9 executors each with 2 core and 5GB memory per executor spread over 3 physical nodes.

Is there any way to optimise this, say bring down the time to do all the aggregate functions on each column to less than 30 minutes atleast with the same cluster, or bumping up my resources is the only way?? which I am personally not very keen to do. One solution I came across to speed up data frame operations is to cache them. But I don't think its a feasible option in my case.

All the real world scenarios I came across use huge clusters for this kind of load.

Any help is appreciated. I use spark 1.6.0 in standalone mode with kryo serializer.

Upvotes: 1

Views: 4393

Answers (1)

arglee
arglee

Reputation: 1404

There are some cool features in sparkSQL like:

Cluster by/ Distribute by/ Sort by

Spark allows you to write queries in SQL-like language - HiveQL. HiveQL let you control the partitioning of data, in the same way we can use this in SparkSQL queries also.

Distribute By

In spark, Dataframe is partitioned by some expression, all the rows for which this expression is equal are on the same partition.

SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY KEY

So, look how it works:

par1: [(1,c), (3,b)]
par2: [(3,c), (1,b), (3,d)]
par3: [(3,a),(2,a)]

This will transform into:

par1: [(1,c), (3,b), (3,c), (1,b), (3,d), (3,a)]
par2: [(2,a)]

Sort By

SELECT * FROM df SORT BY key

for this case it will look like:

par1: [(1,c),  (1,b), (3,b), (3,c), (3,d), (3,a)]
par2: [(2,a)]

Cluster By

This is shortcut for using distribute by and sort by together on the same set of expressions.

SET spark.sql.shuffle.partitions =2
SELECT * FROM df CLUSTER BY key

Note: This is basic information, Let me know if this helps otherwise we can use various different methods to optimize your spark Jobs and queries, according to the situation and settings.

Upvotes: 1

Related Questions