CSV to SQL Server: bulk import nightmare (T-SQL and/or Pandas)

I am trying to bulk insert a .CSV file into SQL Server without much success.

A bit of background:

1. I needed to insert 16 million records into a SQL Server (2017) DB. Each record has 130 columns. I have a field in the .CSV resulting from an API call from one of our vendors which I am not allowed to mention. I had integer, floats and strings data types.

2. I tried the usual: BULK INSERT but I could not get passed the data type errors. I posted a question here but could not make it work.

3. I tried experimenting with python and tried every method I could find but pandas.to_sql for everybody warned it was very slow. I got stuck with data type and string truncate errors. Different to the ones from BULK INSERT.

4. Without much options I tried pd.to_sql and while it did not raise any data type or truncation errors it was failing due to running out of space in my tmp SQL database. I could not pass this error either although I had plenty of space and all my data files (and log files) were set to autogrowth without limit.

I got stuck at that point. My code (for the pd.to_sql piece) was simple:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("mssql+pyodbc://@myDSN")

df.to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)

I am not really sure what else to try, any word of advice is welcome. All codes and examples I have seen deal with small datasets (not many columns). I am willing to try any other method. I would appreciate any pointers.

Thanks!

Upvotes: 3

Answers (3)

ASH

Reputation: 20322

I think df.to_sql is pretty awesome! I have been using it a lot lately. It's a bit slow, when data sets are really huge. If you need speed, I think Bulk Insert will be the fastest option. You can even do the job in batches, so you don't run out of memory, and perhaps overwhelm you machine.

BEGIN TRANSACTION
BEGIN TRY
BULK INSERT  OurTable 
FROM 'c:\OurTable.txt' 
WITH (CODEPAGE = 'RAW', DATAFILETYPE = 'char', FIELDTERMINATOR = '\t', 
   ROWS_PER_BATCH = 10000, TABLOCK)
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
END CATCH

Upvotes: 1

Dashrath Chauhan

Reputation: 95

Loading data from pandas data frame to SQL database is very slow and when dealing with large datasets, running out of memory is a usual case. You want something that is much efficient than that when dealing with large datasets.

d6tstack is something that might solve your problems. Because it works with native DB import commands. It is a custom library that is specifically built for dealing with schema as wells as perfomance issues. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas.

Upvotes: 1

Wilmar

Reputation: 568

I just wanted to share this dirty piece of code just in case it helps anybody else. Note that I am very aware this is not optimal at all, it is slow but I was able to insert about 16 million records in ten minutes without overloading my machine.

I tried doing it in small batches with:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("mssql+pyodbc://@myDSN")

a = 1
b = 1001

while b <= len(df):
    try:
        df[a:b].to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)
        a = b + 1
        b = b + 1000
    except:
        print(f'Error between {a} and {b}')
        continue

Ugly as hell but worked for me.

I am open to all critics and advises. As I mentioned, I am posting this in case it helps anybody else but also looking forward to receive some constructive feedback.

Upvotes: 2

CSV to SQL Server: bulk import nightmare (T-SQL and/or Pandas)

Answers (3)

Related Questions