chabin
chabin

Reputation: 21

DataFrame write to Azure-SQL row-by-row performance

We are using azure databricks spark to write data to Azure SQL database. Last week we switched from runtime 9.1 (spark 3.1) to newer 14.3 (spark 3.5) using spark native JDBC driver. However when we write data, it appears, that Spark JDBC now creates individual "insert into" statements for each row, which results in large DB overhead (especially for large tables) and DB audit log grows enormously. For examples, when we insert 10k rows/3 cols, it creates 10k insert statements, which turns out to be approx. 8 MB of audit log file on blob storage.

Though the audit log are not an operationals issue, it make no sense, why spark does row-by-row insert, since there must be large overhead for sending each query.Is there a was for JDBC SPARK write to have something like bulk insert? Why is there batchsize in write, if it does have no effect?

Details:

create  table sandbox.testbulkwrite (
    id int,
    id2 varchar(2),
    id3 varchar(10)
)

DataFrame size and content

if SPARK_DB_FORMAT in ("jdbc", "com.microsoft.sqlserver.jdbc.SQLServerDriver"):
       (df_final
         .write
        .format(SPARK_DB_FORMAT)
        .option("url", connString) #jdbc conn string
        .mode("append")
        .option("dbtable", "sandbox.testbulkwrite")
        .option("encrypt", "true")
        .option("batchsize",100000)
        .save())
if SPARK_DB_FORMAT == "sqlserver":
        (df_final
        .write
        .format(SPARK_DB_FORMAT)
        .mode("append")
        .option("host", "llll.database.windows.net")
        .option("port", "1433")  # optional, can use default port 1433 if omitted
        .option("user", "YYY")
        .option("ZZZK")
        .option("database", "YYY")
        .option("dbtable", "sandbox.testbulkwrite")
        .option("encrypt", "true")
        .option("batchsize",100000)
        .save())          

once we insert data, I check how many time the statement was executed in SQL database via sys.dm_exec_query_stats table which shows parametrized "INSERT INTO" statement for each dataframe row. execution count of insert

Spark DAG of insert:

When I tested https://github.com/microsoft/sql-spark-connector driver with spark 3.1.2, there is no issue and this driver appers not to relly on invidual insert statements as there is no trace in sys DB logs of these, so it must rely upon different write mechanism.

Upvotes: 1

Views: 167

Answers (1)

four mofo
four mofo

Reputation: 1

After more digging (e.g. https://community.databricks.com/t5/data-engineering/databricks-jdbc-odbc-write-batch-size/td-p/10059 and https://techcommunity.microsoft.com/t5/azure-sql-blog/turbo-boost-data-loads-from-spark-using-sql-spark-connector/ba-p/305523) it appears there is no JDBC driver for writing to Azure SQL which support bulk insert on spark 3.5 as of now.

As of now, we are testing older https://github.com/microsoft/sql-spark-connector though it has compatible version only with spark 3.4, but so far it appears it can perform bulk insert correctly even on spark 3.5 (databricks runtime 14.3). It would be appreciated, if Microsoft actually kept this library alive as there multiple requests for this.

SPARK_DB_FORMAT: str = "com.microsoft.sqlserver.jdbc.spark"
SPARK_DB_DRIVER: str = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
(spark_df
    .write
    .format(SPARK_DB_FORMAT)
    .option("driver", SPARK_DB_DRIVER)
    .option("url", CORE_JDBC_CONN_STR)
    .mode("append")
    .option("dbtable",  "sandbox.testbulkwrite")
    .option("encrypt", "true")
    .option("batchsize","10000")
    .option("tableLock", "true")
    .option("schemaCheckEnabled", "true")
    .save()
)

and manually install JAR from release page spark-mssql-connector_2.12-1.4.0-BETA.jar

Upvotes: 0

Related Questions