Kimi Merroll
Kimi Merroll

Reputation: 311

Pandas to_csv() slow saving large dataframe

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

import os
import glob
import pandas as pd

src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))

# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
    stage = pd.DataFrame()
    iter_csv = pd.read_csv(file_
                     , sep=','
                     , index_col=False
                     , header=0
                     , low_memory=False
                     , iterator=True
                     , chunksize=100000
                     , compression='gzip'
                     , memory_map=True
                     , encoding='utf-8')

    df = pd.concat([chunk for chunk in iter_csv])
    stage = stage.append(df, ignore_index=True)

# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , encoding='utf-8')

del stage

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

Thank you,

Upvotes: 26

Views: 51164

Answers (6)

Tarik
Tarik

Reputation: 11209

Opening a file with a one MB buffer makes a significant difference when saving on a shared folder.

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': range(1000000)})
df['A'] = df.loc[:, 'A'] * 2 * np.pi

file_name = r'\\shared_folder_path\blah.csv'
print(timeit.timeit(lambda: df.to_csv(file_name), number=1))
print(timeit.timeit(lambda: df.to_csv(file_name, chunksize=1000000), number=1))
print(timeit.timeit(lambda: df.to_csv(open(file_name, 'wb', 1000000)), number=1))
print(timeit.timeit(lambda: df.to_csv(open(file_name, 'wb', 1000000), chunksize=1000000), number=1))

Output:

59.76983120001387
61.62541880001663
6.958319600002142
9.22059939999599

Using a buffer helps while the chunksize parameter is detrimental.

When changing the file path to the local disc, we get the following output:

2.2724577999906614
2.2463568999955896
2.1668612000066787
2.2025332000048365

The impact of buffering is insignificant while the chunksize parameter is counterproductive, again.

Upvotes: 0

Amir F
Amir F

Reputation: 2529

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

stage.to_hdf(r'path/file.h5', key='stage', mode='w')

It is possible to save different data to different key names, so when we save the data, whatever key name we choose will need to be used when reading the data back.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html

Upvotes: 17

stucash
stucash

Reputation: 1258

Try either Apache's parquet file format, or polars package, which is an alternative to the usual pandas.

I was trying to cache some data locally from my server, it has 59 millions rows on 9 columns; pandas.DataFrame.to_csv simply died therefore couldn't be timed.

I put a breakpoint on the way out and saved it down using parquet and read it back into polars dataframe (the reading wasn't timed but it was roughly 5-10 seconds):

[ins] In [6]:import polars as pl
[ins] In []:pf = pl.read_parquet('path_to_my_data.parquet') 

I wrote this huge dataframe to csv using polars:

[ins] In [8]: %timeit pf.write_csv('path_to_my_data.csv')                                                                         
24.3 s ± 5.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

I casted polars dataframe to a pandas one and wrote it down using both hdf and parquet:

[ins] In [9]: df = pf.to_pandas() 
[ins] In [11]: %timeit df.to_parquet('path_to_data2.parquet')                                                                                       
11.7 s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [12]: %timeit df.to_hdf('path_to_my_data.h5', key="stage", mode="w")  
15.4 s ± 723 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The parquet file was 1.8G whereas the h5 file was 4.3G. to_parquet from pandas has performed compression (snappy, gzip, or brotil), however we as end users don't need to decompress it.

Either of them can be a promising, if not exceeding, alternative if you need to deal with huge amount of data and data query back and forth is a must.

Upvotes: 0

Bo Knows
Bo Knows

Reputation: 1

I used to use to_csv() to output to company network drive which was too slow and took one hour to output 1GB csv file. just tried to output to my laptop C: drive with to_csv() statement, it only took 2 mins to output 1GB csv file.

Upvotes: 0

lucas F
lucas F

Reputation: 381

You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?

I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Dates columns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:

df['Column'] = df['Column'].astype(str) 

I hope that this answer helps you.

P.S.: I understand that saving as a .hdf file solved the problem. But, sometimes, we do need a .csv file anyway.

Upvotes: 12

Frane
Frane

Reputation: 534

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

Writing compressed file could speedup writing up to 10x

    stage.to_csv('output.csv.gz'
         , sep='|'
         , header=True
         , index=False
         , chunksize=100000
         , compression='gzip'
         , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

Upvotes: 14

Related Questions