Reputation: 311
I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).
import os
import glob
import pandas as pd
src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))
# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
stage = pd.DataFrame()
iter_csv = pd.read_csv(file_
, sep=','
, index_col=False
, header=0
, low_memory=False
, iterator=True
, chunksize=100000
, compression='gzip'
, memory_map=True
, encoding='utf-8')
df = pd.concat([chunk for chunk in iter_csv])
stage = stage.append(df, ignore_index=True)
# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
, sep='|'
, header=True
, index=False
, chunksize=100000
, encoding='utf-8')
del stage
I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.
Thank you,
Upvotes: 26
Views: 51164
Reputation: 11209
Opening a file with a one MB buffer makes a significant difference when saving on a shared folder.
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(1000000)})
df['A'] = df.loc[:, 'A'] * 2 * np.pi
file_name = r'\\shared_folder_path\blah.csv'
print(timeit.timeit(lambda: df.to_csv(file_name), number=1))
print(timeit.timeit(lambda: df.to_csv(file_name, chunksize=1000000), number=1))
print(timeit.timeit(lambda: df.to_csv(open(file_name, 'wb', 1000000)), number=1))
print(timeit.timeit(lambda: df.to_csv(open(file_name, 'wb', 1000000), chunksize=1000000), number=1))
Output:
59.76983120001387
61.62541880001663
6.958319600002142
9.22059939999599
Using a buffer helps while the chunksize parameter is detrimental.
When changing the file path to the local disc, we get the following output:
2.2724577999906614
2.2463568999955896
2.1668612000066787
2.2025332000048365
The impact of buffering is insignificant while the chunksize parameter is counterproductive, again.
Upvotes: 0
Reputation: 2529
Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)
stage.to_hdf(r'path/file.h5', key='stage', mode='w')
It is possible to save different data to different key names, so when we save the data, whatever key name we choose will need to be used when reading the data back.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html
Upvotes: 17
Reputation: 1258
Try either Apache's parquet
file format, or polars
package, which is an alternative to the usual pandas
.
I was trying to cache some data locally from my server, it has 59 millions rows on 9 columns; pandas.DataFrame.to_csv
simply died therefore couldn't be timed.
I put a breakpoint on the way out and saved it down using parquet
and read it back into polars
dataframe (the reading wasn't timed but it was roughly 5-10 seconds):
[ins] In [6]:import polars as pl
[ins] In []:pf = pl.read_parquet('path_to_my_data.parquet')
I wrote this huge dataframe to csv using polars
:
[ins] In [8]: %timeit pf.write_csv('path_to_my_data.csv')
24.3 s ± 5.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
I casted polars
dataframe to a pandas
one and wrote it down using both hdf
and parquet
:
[ins] In [9]: df = pf.to_pandas()
[ins] In [11]: %timeit df.to_parquet('path_to_data2.parquet')
11.7 s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[ins] In [12]: %timeit df.to_hdf('path_to_my_data.h5', key="stage", mode="w")
15.4 s ± 723 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The parquet
file was 1.8G whereas the h5
file was 4.3G. to_parquet
from pandas
has performed compression (snappy
, gzip
, or brotil
), however we as end users don't need to decompress it.
Either of them can be a promising, if not exceeding, alternative if you need to deal with huge amount of data and data query back and forth is a must.
Upvotes: 0
Reputation: 1
I used to use to_csv()
to output to company network drive which was too slow and took one hour to output 1GB csv file. just tried to output to my laptop C: drive with to_csv()
statement, it only took 2 mins to output 1GB csv file.
Upvotes: 0
Reputation: 381
You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?
I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Dates
columns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:
df['Column'] = df['Column'].astype(str)
I hope that this answer helps you.
P.S.: I understand that saving as a .hdf
file solved the problem. But, sometimes, we do need a .csv
file anyway.
Upvotes: 12
Reputation: 534
You are reading compressed files and writing plaintext file. Could be IO bottleneck.
Writing compressed file could speedup writing up to 10x
stage.to_csv('output.csv.gz'
, sep='|'
, header=True
, index=False
, chunksize=100000
, compression='gzip'
, encoding='utf-8')
Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).
Upvotes: 14