Sajila S
Sajila S

Reputation: 41

How to improve query performance in spark?

I have a query which joins 4 tables and i used query pushdown to read it into a dataframe.

val df = spark.read.format("jdbc").
 option("url", "jdbc:mysql://ip/dbname").
 option("driver", "com.mysql.jdbc.Driver").
 option("user", "username").
 option("password", "password")
 .option("dbtable",s"($query) as temptable")
 .load()

The number of records in individual tables are 430, 350, 64, 2354 respectively and it takes 12.784 sec to load and 2.119 sec for creating SparkSession

then I count the resultdata as,

 val count=df.count()
 println(s"count $count")

then the total execution time 25.806 sec and the result contains only 430 records.

When I try the same in sql workbench it only takes few sec to execute completely. Also I tried cache after load() but it take the same time. So how can I execute it much faster than what I did.

Upvotes: 1

Views: 431

Answers (2)

ankit
ankit

Reputation: 1529

Try Options like

partitionColumn

numPartitions

lowerBound

upperBound

These options will can help improving the performance of Query, as these will create multiple partitions and read will happen in parallel

Upvotes: 0

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25929

You are using a tool meant to handle big data to solve toy examples and thus you are getting all of the overhead and none of the benefits

Upvotes: 5

Related Questions