Reputation: 177
I'm trying to use pandas DataFrame.to_sql()
to send a large DataFrame (>1M rows) to an MS SQL server database.
The command is significantly slower on one particular DataFrame, taking about 130 sec to send 10,000 rows. In contrast, a similar DataFrame takes just 7 sec to send the same number of rows. The latter DataFrame actually has more columns, and more data as measured by df.memory_usage(deep=True)
.
The SQLAlchemy engine is created via
engine = create_engine('mssql+pyodbc://@<server>/<db>?driver=ODBC+Driver+17+for+SQL+Server', fast_executemany=True)
The to_sql()
call is as follows:
df[i:i+chunksize].to_sql(table, conn, index=False, if_exists='replace')
where chunksize = 10000
.
I've attempted to locate the bottleneck via cProfile
, but this only revealed that nearly all of the time is spent in pyodbc.Cursor.executemany
.
Any tips for debugging would be appreciated!
Upvotes: 2
Views: 3646
Reputation: 177
The performance difference is due to an issue in pyodbc
where passing None
values to SQL Server INSERT
statements when using the fast_executemany=True
option results in slow downs.
We can pack the values as JSON
and use OPENJSON
(supported on SQL Server 2016+) instead of fast_executemany
. This solution resulted in a 30x performance improvement in my application! Here's a self-contained example, based on the documentation here, but adapted for pandas
users.
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame({'First Name': ['Homer', 'Ned'], 'Last Name': ['Simpson', 'Flanders']})
rows_as_json = df.to_json(orient='records')
server = '<server name>'
db = '<database name>'
table = '<table name>'
engine = create_engine(f'mssql+pyodbc://<user>:<password>@{server}/{db}')
sql = f'''
INSERT INTO {table} ([First Name], [Last Name])
SELECT [First Name], [Last Name] FROM
OPENJSON(?)
WITH (
[First Name] nvarchar(50) '$."First Name"',
[Last Name] nvarchar(50) '$."Last Name"'
)
'''
cursor = engine.raw_connection().cursor()
cursor.execute(sql, rows_as_json)
cursor.commit()
None
to a non-empty filler value, add a helper column to indicate which rows were changed, complete the operation as normal via to_sql()
, then reset the filler values to Null
via a separate query based on the helper column.Many thanks to @GordThompson for pointing me to the solution!
Upvotes: 3