Reputation: 487
I am currently working on a azure date bricks notebook that read files from a storage container into a data frame and then writes all the records to a table in MySQL. The file can have anywhere from 2 million to 10 million rows. I have the following write code in my notebook after using read to populate my data frame.
newFile.repartition(16).write
.format("jdbc").option("driver", "com.mysql.cj.jdbc.Driver")
.mode("append")
.option("url", s"${jdbc_url}")
.option("dbtable", pathToDBTable)
.option("user", s"user")
.option("password", s"pass")
.save()
I have played around with the partitions and decided to go with 16 because my cluster will have 16 cores. Other than this is there a faster way to insert all this data into my DB using write? Or any other suggested approaches to try within azure data bricks notebooks? It currently takes 10-15 min for 2 million row files.
Upvotes: 1
Views: 508
Reputation: 87259
Most probably the delay is caused by the MySQL side - you're writing using 16 cores, and each is opening a separate connection to MySQL, so you can overload the database. Also performance could be affected if you have indexes on the columns, etc.
So it's recommended to check the MySQL side for reporting about problems, look onto load on the database node, check how many cores instance has, etc.
Upvotes: 0