Mohan
Mohan

Reputation: 897

Pyspark Operations are Slower than Hive

I have 3 dataframes df1, df2 and df3. Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)

These are the operations performed:

This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.

But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?

Upvotes: 0

Views: 590

Answers (1)

Zichu Lee
Zichu Lee

Reputation: 107

You should be careful with the use of JOIN.

JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.

Upvotes: 0

Related Questions