DataFrame write to Azure-SQL row-by-row performance

Question

We are using azure databricks spark to write data to Azure SQL database. Last week we switched from runtime 9.1 (spark 3.1) to newer 14.3 (spark 3.5) using spark native JDBC driver. However when we write data, it appears, that Spark JDBC now creates individual "insert into" statements for each row, which results in large DB overhead (especially for large tables) and DB audit log grows enormously. For examples, when we insert 10k rows/3 cols, it creates 10k insert statements, which turns out to be approx. 8 MB of audit log file on blob storage.

setting batchsize option has no effect on insert process at all
performance issues: when inserting large tables (+10Mil rows/17 cols) we must cache and repartion dataframe otherwise write often fails on spark INTERNAL_ERROR (spark logs show timeouts on executors) most likely due to DB cant do that many inserts.

Though the audit log are not an operationals issue, it make no sense, why spark does row-by-row insert, since there must be large overhead for sending each query.Is there a was for JDBC SPARK write to have something like bulk insert? Why is there batchsize in write, if it does have no effect?

Details:

SQL TABLE

create  table sandbox.testbulkwrite (
    id int,
    id2 varchar(2),
    id3 varchar(10)
)

spark DF to be inserted

tested runtime: 14.3 and 15.4 (spark 3.5.0)
tested JDBC configurations:
- native spark "JDBC" as described here: https://learn.microsoft.com/en-us/azure/databricks/connect/external-systems/jdbc or com.microsoft.sqlserver.jdbc.SQLServerDriver with mssql-jdbc-12.8.1.jre8.jar library installed on cluster from maven ( https://mvnrepository.com/artifact/com.microsoft.sqlserver/mssql-jdbc/12.8.1.jre8) Write code:

if SPARK_DB_FORMAT in ("jdbc", "com.microsoft.sqlserver.jdbc.SQLServerDriver"):
       (df_final
         .write
        .format(SPARK_DB_FORMAT)
        .option("url", connString) #jdbc conn string
        .mode("append")
        .option("dbtable", "sandbox.testbulkwrite")
        .option("encrypt", "true")
        .option("batchsize",100000)
        .save())

sqlserver as described here https://learn.microsoft.com/en-us/azure/databricks/connect/external-systems/sql-server

if SPARK_DB_FORMAT == "sqlserver":
        (df_final
        .write
        .format(SPARK_DB_FORMAT)
        .mode("append")
        .option("host", "llll.database.windows.net")
        .option("port", "1433")  # optional, can use default port 1433 if omitted
        .option("user", "YYY")
        .option("ZZZK")
        .option("database", "YYY")
        .option("dbtable", "sandbox.testbulkwrite")
        .option("encrypt", "true")
        .option("batchsize",100000)
        .save())

once we insert data, I check how many time the statement was executed in SQL database via sys.dm_exec_query_stats table which shows parametrized "INSERT INTO" statement for each dataframe row.

When I tested https://github.com/microsoft/sql-spark-connector driver with spark 3.1.2, there is no issue and this driver appers not to relly on invidual insert statements as there is no trace in sys DB logs of these, so it must rely upon different write mechanism.

DataFrame write to Azure-SQL row-by-row performance

Answers (1)

Related Questions