rarova
rarova

Reputation: 43

How to improve performance of spark.write for jdbc?

I'm struggling with one thing. I have 700mb csv which conains over 6mln rows. After filtering it contains ~3mln.

I need to write it straight to azure sql via jdbc. It's super slow and takes 20min to input 3mln rows.

My cluster has 14gb ram and 4 cores. Here is my code.

(clearedDF.repartition(4)
  .write
  .format("jdbc")
  .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
  .option("batchsize", 10000)
  .option("url", jdbcUrl)
  .option("dbtable", "dbo.weather")
  .option("user", properties["user"])
  .option("password", properties["password"])
  .mode("append")
  .save()
)

Is there any way to speed this process up?

Upvotes: 3

Views: 3154

Answers (1)

AjayKumarGhose
AjayKumarGhose

Reputation: 4943

Thank you Alex Ott. Posting your suggestion as an answer to help community members.

"You can control the parallelism by calling coalesce(<N>) or repartition(<N>) depending on the existing number of partitions. Call coalesce when reducing the number of partitions, and repartition when increasing the number of partitions."

import org.apache.spark.sql.SaveMode

val df = spark.table("diamonds")
println(df.rdd.partitions.length)

// Given the number of partitions above, you can reduce the partition value by calling coalesce() or increase it by calling repartition() to manage the number of connections.
df.repartition(10).write.mode(SaveMode.Append).jdbc(jdbcUrl, "diamonds", connectionProperties)

For more information please refer Azure sql using jdbc.| MS DOC

Upvotes: 1

Related Questions