David Erickson
David Erickson

Reputation: 16683

Large Dask/Pandas DataFrame (27M rows x 52 columns) .to_csv or .to_sql MemoryError

Long story short, I've been struggling with memory issues (as well as time to execute) dask / pandas to write a large dataframe to_sql or to_csv if anyone can point me in the right direction (more detail below)?

I have appended two large .csv files together to form an even larger .csv file (27M rows x 52 columns) with dask in order to save memory instead of using pandas, where I was running out of memory or close to it.

From there, I am simply trying to write to a .csv, but I am again running out memory trying to use .compute() to transform from a dask dataframe to a pandas dataframe and write the file. I would like to focus this question on how I can simply get this thing into a .csv without running out of memory. From there, I plan to use SQL Server's import wizard to import the file to a new table in a database. See the code below for how I am reading and concat'ing as a dask dataframe, transforming to a pandas dataframe and then attempting to write to a .csv before running out of memory:

df1 = dd.read_csv('C:/Users/david.erickson/file1.csv', dtype={'Phone #': str, 'Called #': str})
df2 = dd.read_csv('C:/Users/david.erickson/file2.csv', dtype={'Phone #': str, 'Called #': str})
df_combined = dd.concat([df1,df2], axis=0).drop_duplicates()
df_combined = df_combined.compute()
df_combined.to_csv('C:/Users/david.erickson/file_combined.csv', index=False)

I am running into memory errors at either the fourth line or the fifth line after watching python approach 38 GB of Committed Physical Memory:

MemoryError: Unable to allocate 210. MiB for an array with shape (27571126,) and data type int64

Before anyone mentions it, I have also tirelessly tried to_sql as well with no success as the ultimate goal is to get this data into a SQL Sever database. It keeps taking forever and I am running into memory errors with sqlalchemy's to_sql. Even better, I wish I could write directly to SQL Server, and this what I tried:

engine = sa.create_engine(cnxn_string, fast_executemany=True)
connection = engine.connect()
df.to_sql('New_Table', con=engine, if_exists='replace', index=False) 

I also tried: df6.to_sql('New_Table', con=engine, if_exists='replace', index=False, chunksize=40, method='multi') # There is a limit of 2098 parameters, which is why the chunksize=40 (40*52 columns=2080, so less than the limit. Anyway, this was slower than only passing fast_executemany=True to create_engine)

Unfortunately, the mentioned attempts as well as some other python sqlalchemy approaches that I researched with .to_sql just ran out of memory or the operation took forever (overnight), and I had to kill it.

Of the 52 columns and 27.5 million rows, I would guess that the average string size per cell is ~20 characters (at least for the text columns). There has to be something I can do differently to get this data into a SQL Server database, but I am really struggling I have never had issues with to_sql or to_csv before this project with much more data.

Processor

Memory

Upvotes: 1

Views: 1410

Answers (1)

Repr
Repr

Reputation: 201

assuming the CSV have something like a primary key, split the dataset. If you have 52 columns, split it in 4 frames with 14 columns (4x 13 for each column, and duplicate the primary key column so you can match them) and then pass the smaller dataset to SQL.

or make a list of primary keys, and only insert the top N, commit the transaction, and start on the next batch.

also, MS SQL has an import function for .csv that tends to be fairly efficient since it batches its inserts. Why use python as a layer between the 2?

Upvotes: 1

Related Questions