Nick
Nick

Reputation: 177

Pandas to_sql() slow on one DataFrame but fast on others

Goal

I'm trying to use pandas DataFrame.to_sql() to send a large DataFrame (>1M rows) to an MS SQL server database.

Problem

The command is significantly slower on one particular DataFrame, taking about 130 sec to send 10,000 rows. In contrast, a similar DataFrame takes just 7 sec to send the same number of rows. The latter DataFrame actually has more columns, and more data as measured by df.memory_usage(deep=True).

Details

The SQLAlchemy engine is created via

engine = create_engine('mssql+pyodbc://@<server>/<db>?driver=ODBC+Driver+17+for+SQL+Server', fast_executemany=True)

The to_sql() call is as follows: df[i:i+chunksize].to_sql(table, conn, index=False, if_exists='replace') where chunksize = 10000.

I've attempted to locate the bottleneck via cProfile, but this only revealed that nearly all of the time is spent in pyodbc.Cursor.executemany.

Any tips for debugging would be appreciated!

Upvotes: 2

Views: 3646

Answers (1)

Nick
Nick

Reputation: 177

Cause

The performance difference is due to an issue in pyodbc where passing None values to SQL Server INSERT statements when using the fast_executemany=True option results in slow downs.

Solution

We can pack the values as JSON and use OPENJSON (supported on SQL Server 2016+) instead of fast_executemany. This solution resulted in a 30x performance improvement in my application! Here's a self-contained example, based on the documentation here, but adapted for pandas users.

import pandas as pd
from sqlalchemy import create_engine

df = pd.DataFrame({'First Name': ['Homer', 'Ned'], 'Last Name': ['Simpson', 'Flanders']})
rows_as_json = df.to_json(orient='records')

server = '<server name>'
db = '<database name>'
table = '<table name>'
engine = create_engine(f'mssql+pyodbc://<user>:<password>@{server}/{db}')

sql = f'''
INSERT INTO {table} ([First Name], [Last Name])
SELECT [First Name], [Last Name] FROM
    OPENJSON(?)
    WITH (
        [First Name] nvarchar(50) '$."First Name"',
        [Last Name] nvarchar(50) '$."Last Name"'
    )
'''

cursor = engine.raw_connection().cursor()
cursor.execute(sql, rows_as_json)
cursor.commit()

Alternative Workarounds

  • Export data to CSV and use an external tool to complete the transfer (for example, the bcp utility).
  • Artificially replace values that are converted to None to a non-empty filler value, add a helper column to indicate which rows were changed, complete the operation as normal via to_sql(), then reset the filler values to Null via a separate query based on the helper column.

Acknowledgement

Many thanks to @GordThompson for pointing me to the solution!

Upvotes: 3

Related Questions