Reputation: 41
I have a query which joins 4 tables and i used query pushdown to read it into a dataframe.
val df = spark.read.format("jdbc").
option("url", "jdbc:mysql://ip/dbname").
option("driver", "com.mysql.jdbc.Driver").
option("user", "username").
option("password", "password")
.option("dbtable",s"($query) as temptable")
.load()
The number of records in individual tables are 430, 350, 64, 2354 respectively and it takes 12.784 sec to load and 2.119 sec for creating SparkSession
then I count the resultdata as,
val count=df.count()
println(s"count $count")
then the total execution time 25.806 sec and the result contains only 430 records.
When I try the same in sql workbench it only takes few sec to execute completely. Also I tried cache after load() but it take the same time. So how can I execute it much faster than what I did.
Upvotes: 1
Views: 431
Reputation: 1529
Try Options like
partitionColumn
numPartitions
lowerBound
upperBound
These options will can help improving the performance of Query, as these will create multiple partitions and read will happen in parallel
Upvotes: 0
Reputation: 25929
You are using a tool meant to handle big data to solve toy examples and thus you are getting all of the overhead and none of the benefits
Upvotes: 5