Reputation: 300
I want to write to DB using pyspark as below
df.write.format("jdbc").option("url", jdbc_url).option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", table).option("user", user).option("password", password)\
.mode('append').save()
I am not sure how it works fully, i.e. is it atomic in nature, sometimes even if there is a failure i see some records inserted and sometimes in case of errors none of the records is inserted. What is the default behaviour of this write.
Also is there any option where i can write while ignoring the DML errors like column length issues or date format issues etc.
Upvotes: 1
Views: 321
Reputation: 943
insert
queries parallelly in independent transactions and this is why some times you find data in case of error and sometimes you don't. It may possible that insertion order is not the same in each and every time. Spark will auto created query based on the value of spark.sql.shuffle.partitions
the default value will be 200
or if you have used df.repartition(val)
. So based on the configuration and available partitions, it will execute insert query parallelly having independent transactions for each partition. If any error encountered, it will stop the execution.MySQL
so Spark
doesn't have any control over it. So NO
there can not be the way that you can ignore such DML errors. Spark will execute a query as any other external system do. It should be the developer's responsibility that data are in expected length and format. You can always convert the data in the expected format before writing to MySQL
.For more info refer this.
Upvotes: 1