Kumar P
Kumar P

Reputation: 300

PySpark and MySQL Options

I want to write to DB using pyspark as below

df.write.format("jdbc").option("url", jdbc_url).option("driver", "com.mysql.jdbc.Driver")\
          .option("dbtable", table).option("user", user).option("password", password)\
          .mode('append').save()

I am not sure how it works fully, i.e. is it atomic in nature, sometimes even if there is a failure i see some records inserted and sometimes in case of errors none of the records is inserted. What is the default behaviour of this write.

Also is there any option where i can write while ignoring the DML errors like column length issues or date format issues etc.

Upvotes: 1

Views: 321

Answers (1)

Saurabh
Saurabh

Reputation: 943

  1. Spark will execute insert queries parallelly in independent transactions and this is why some times you find data in case of error and sometimes you don't. It may possible that insertion order is not the same in each and every time. Spark will auto created query based on the value of spark.sql.shuffle.partitions the default value will be 200 or if you have used df.repartition(val). So based on the configuration and available partitions, it will execute insert query parallelly having independent transactions for each partition. If any error encountered, it will stop the execution.
  2. To answer your other question, you are executing queries on MySQL so Spark doesn't have any control over it. So NO there can not be the way that you can ignore such DML errors. Spark will execute a query as any other external system do. It should be the developer's responsibility that data are in expected length and format. You can always convert the data in the expected format before writing to MySQL.

For more info refer this.

Upvotes: 1

Related Questions