user19498404
user19498404

Reputation: 31

Pandas .to_sql() Output exceeds the size limit

I'm using pydobc and sqlalchemy to insert data into a table in SQL Server, and I'm getting this error.

https://i.sstatic.net/miSp9.png

Here are snippets of the functions I use.

This is the function I use to connect to the SQL server (using fast_executemany)

def connect(server, database):
    global cnxn_str, cnxn, cur, quoted, engine
    cnxn_str = ("Driver={SQL Server Native Client 11.0};"
                "Server=<server>;"
                "Database=<database>;"
                "UID=<user>;"
                "PWD=<password>;")
    cnxn = pyodbc.connect(cnxn_str)
    cur = cnxn.cursor()
    cur.fast_executemany=True

    quoted = quote_plus(cnxn_str)
    engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted), fast_executemany=True)

And this is the function I'm using to query and insert the data into the SQL server

def insert_to_sql_server():
    global df, np_array
    
    # Dataframe df is from a numpy array dtype = object
    df = pd.DataFrame(np_array[1:,],columns=np_array[0])
    
    # add new columns, data processing
    df['comp_key'] = df['col1']+"-"+df['col2'].astype(str)
    df['comp_key2'] = df['col3']+"-"+df['col4'].astype(str)+"-"+df['col5'].astype(str)
    df['comp_statusID'] = df['col6']+"-"+df['col7'].astype(str)

    convert_dict = {'col1': 'string', 'col2': 'string', ..., 'col_n': 'string'}
    
    # convert data types of columns from objects to strings
    df = df.astype(convert_dict)

    connect(<server>, <database>)
    cur.rollback()

    # Delete old records
    cur.execute("DELETE FROM <table>")

    cur.commit()

    # Insert dataframe to table
    df.to_sql(<table name>, engine, index=False, \
        if_exists='replace', schema='dbo', chunksize=1000, method='multi')

The insert function runs for about 30 minutes before finally returning the error message.

I encountered no errors when doing it with a smaller df size. The current df size I have is 27963 rows and 9 columns. One thing which I think contributes to the error is the length of the string. By default the numpy array is dtype='<U25', but I had to override this to dtype='object' because it was truncating the text data.

I'm out of ideas because it seems like the error is referring to limitations of either Pandas or the SQL Server, which I'm not familiar with.

Thanks

Upvotes: 1

Views: 2750

Answers (3)

Shakeel
Shakeel

Reputation: 21

You can show a progress bar while inserting data into a Server table, you can utilize the tqdm library. Here's an example:

import pandas as pd
from sqlalchemy import create_engine
from tqdm import tqdm

# Define the SQL Server connection string
server = 'your_server_name'
database = 'your_database_name'
table_name = 'your_table_name'
connection_string = f'mssql+pyodbc://@{server}/{database}?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server'

# Create the SQLAlchemy engine
engine = create_engine(connection_string)

# Set the chunk size
chunk_size = 10000  # Adjust this value based on your requirements

# Get the total number of chunks
total_chunks = len(df) // chunk_size + (len(df) % chunk_size > 0)

# Create a progress bar
progress_bar = tqdm(total=total_chunks, desc='Inserting data', unit='chunk')

# Function to generate chunks of the DataFrame
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

# Insert the DataFrame in chunks
for i, chunk in enumerate(chunker(df, chunk_size)):
    replace = 'replace' if i == 0 else 'append'

    # Check if the chunk contains any new data
    if i == 0 or chunk.isin(existing_data).sum().sum() < len(chunk):
        # Insert the chunk into the SQL Server table
        chunk.to_sql(name=table_name, con=engine, if_exists=replace, index=False)

    # Update the progress bar
    progress_bar.update()

# Close the progress bar
progress_bar.close()

# Print a message when the insertion is complete
print('Data insertion complete.')

Upvotes: 0

santiago-giordano
santiago-giordano

Reputation: 21

In my case, I had the same "Output exceeds the size limit" error, and I fixed it adding "method='multi'" in df.to_sql(method='multi'). First I tried the "chuncksize" solution and it didn't work. So... check that if you're at the same scenario!

with engine.connect().execution_options(autocommit=True) as conn:
    df.to_sql('mytable', con=conn, method='multi', if_exists='replace', index=True)

Upvotes: 0

user19498404
user19498404

Reputation: 31

Thanks for all the input (still new here)! Accidentally stumbled upon the solution, which is by reducing the df.to_sql from

df.to_sql(chunksize=1000)

to

df.to_sql(chunksize=200)

After digging it turns out there's a limitation from SQL server (https://discuss.dizzycoding.com/to_sql-pyodbc-count-field-incorrect-or-syntax-error/)

Upvotes: 2

Related Questions