Large Dask/Pandas DataFrame (27M rows x 52 columns) .to_csv or .to_sql MemoryError

Question

Long story short, I've been struggling with memory issues (as well as time to execute) dask / pandas to write a large dataframe to_sql or to_csv if anyone can point me in the right direction (more detail below)?

I have appended two large .csv files together to form an even larger .csv file (27M rows x 52 columns) with dask in order to save memory instead of using pandas, where I was running out of memory or close to it.

From there, I am simply trying to write to a .csv, but I am again running out memory trying to use .compute() to transform from a dask dataframe to a pandas dataframe and write the file. I would like to focus this question on how I can simply get this thing into a .csv without running out of memory. From there, I plan to use SQL Server's import wizard to import the file to a new table in a database. See the code below for how I am reading and concat'ing as a dask dataframe, transforming to a pandas dataframe and then attempting to write to a .csv before running out of memory:

df1 = dd.read_csv('C:/Users/david.erickson/file1.csv', dtype={'Phone #': str, 'Called #': str})
df2 = dd.read_csv('C:/Users/david.erickson/file2.csv', dtype={'Phone #': str, 'Called #': str})
df_combined = dd.concat([df1,df2], axis=0).drop_duplicates()
df_combined = df_combined.compute()
df_combined.to_csv('C:/Users/david.erickson/file_combined.csv', index=False)

I am running into memory errors at either the fourth line or the fifth line after watching python approach 38 GB of Committed Physical Memory:

MemoryError: Unable to allocate 210. MiB for an array with shape (27571126,) and data type int64

Before anyone mentions it, I have also tirelessly tried to_sql as well with no success as the ultimate goal is to get this data into a SQL Sever database. It keeps taking forever and I am running into memory errors with sqlalchemy's to_sql. Even better, I wish I could write directly to SQL Server, and this what I tried:

engine = sa.create_engine(cnxn_string, fast_executemany=True)
connection = engine.connect()
df.to_sql('New_Table', con=engine, if_exists='replace', index=False)

I also tried: df6.to_sql('New_Table', con=engine, if_exists='replace', index=False, chunksize=40, method='multi') # There is a limit of 2098 parameters, which is why the chunksize=40 (40*52 columns=2080, so less than the limit. Anyway, this was slower than only passing fast_executemany=True to create_engine)

Unfortunately, the mentioned attempts as well as some other python sqlalchemy approaches that I researched with .to_sql just ran out of memory or the operation took forever (overnight), and I had to kill it.

Of the 52 columns and 27.5 million rows, I would guess that the average string size per cell is ~20 characters (at least for the text columns). There has to be something I can do differently to get this data into a SQL Server database, but I am really struggling I have never had issues with to_sql or to_csv before this project with much more data.

Large Dask/Pandas DataFrame (27M rows x 52 columns) .to_csv or .to_sql MemoryError

Answers (1)

Related Questions