Reputation: 43
I'm struggling with one thing. I have 700mb csv which conains over 6mln rows. After filtering it contains ~3mln.
I need to write it straight to azure sql via jdbc. It's super slow and takes 20min to input 3mln rows.
My cluster has 14gb ram and 4 cores. Here is my code.
(clearedDF.repartition(4)
.write
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("batchsize", 10000)
.option("url", jdbcUrl)
.option("dbtable", "dbo.weather")
.option("user", properties["user"])
.option("password", properties["password"])
.mode("append")
.save()
)
Is there any way to speed this process up?
Upvotes: 3
Views: 3154
Reputation: 4943
Thank you Alex Ott. Posting your suggestion as an answer to help community members.
"You can control the parallelism by calling coalesce(<N>)
or repartition(<N>)
depending on the existing number of partitions. Call coalesce
when reducing the number of partitions, and repartition
when increasing the number of partitions."
import org.apache.spark.sql.SaveMode
val df = spark.table("diamonds")
println(df.rdd.partitions.length)
// Given the number of partitions above, you can reduce the partition value by calling coalesce() or increase it by calling repartition() to manage the number of connections.
df.repartition(10).write.mode(SaveMode.Append).jdbc(jdbcUrl, "diamonds", connectionProperties)
For more information please refer Azure sql using jdbc.| MS DOC
Upvotes: 1