Reputation: 568
I am trying to bulk insert a .CSV
file into SQL Server without much success.
A bit of background:
1. I needed to insert 16 million records into a SQL Server (2017) DB. Each record has 130 columns. I have a field in the .CSV
resulting from an API call from one of our vendors which I am not allowed to mention. I had integer, floats and strings data types.
2. I tried the usual: BULK INSERT
but I could not get passed the data type errors. I posted a question here but could not make it work.
3. I tried experimenting with python and tried every method I could find but pandas.to_sql
for everybody warned it was very slow. I got stuck with data type and string truncate errors. Different to the ones from BULK INSERT
.
4. Without much options I tried pd.to_sql
and while it did not raise any data type or truncation errors it was failing due to running out of space in my tmp SQL database. I could not pass this error either although I had plenty of space and all my data files (and log files) were set to autogrowth without limit.
I got stuck at that point. My code (for the pd.to_sql
piece) was simple:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://@myDSN")
df.to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)
I am not really sure what else to try, any word of advice is welcome. All codes and examples I have seen deal with small datasets (not many columns). I am willing to try any other method. I would appreciate any pointers.
Thanks!
Upvotes: 3
Views: 728
Reputation: 20322
I think df.to_sql
is pretty awesome! I have been using it a lot lately. It's a bit slow, when data sets are really huge. If you need speed, I think Bulk Insert will be the fastest option. You can even do the job in batches, so you don't run out of memory, and perhaps overwhelm you machine.
BEGIN TRANSACTION
BEGIN TRY
BULK INSERT OurTable
FROM 'c:\OurTable.txt'
WITH (CODEPAGE = 'RAW', DATAFILETYPE = 'char', FIELDTERMINATOR = '\t',
ROWS_PER_BATCH = 10000, TABLOCK)
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
END CATCH
Upvotes: 1
Reputation: 95
Loading data from pandas data frame to SQL database is very slow and when dealing with large datasets, running out of memory is a usual case. You want something that is much efficient than that when dealing with large datasets.
d6tstack is something that might solve your problems. Because it works with native DB import commands. It is a custom library that is specifically built for dealing with schema as wells as perfomance issues. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas.
Upvotes: 1
Reputation: 568
I just wanted to share this dirty piece of code just in case it helps anybody else. Note that I am very aware this is not optimal at all, it is slow but I was able to insert about 16 million records in ten minutes without overloading my machine.
I tried doing it in small batches with:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://@myDSN")
a = 1
b = 1001
while b <= len(df):
try:
df[a:b].to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)
a = b + 1
b = b + 1000
except:
print(f'Error between {a} and {b}')
continue
Ugly as hell but worked for me.
I am open to all critics and advises. As I mentioned, I am posting this in case it helps anybody else but also looking forward to receive some constructive feedback.
Upvotes: 2