Reputation: 915
I have 600 Dataframes saved and stored as .pickle and I'd like to merge (or rather append) them into one DataFrame. The total size of them is 10GB.
When I read each of them and append them into one big DataFrame and then save the full version to dist the entire process takes 2 hours on 16GB machine.
I think it takes a lot of time because each time I append a new DataFrame system allocates new memory space for the entire new DataFrame?
How can I do this faster?
Upvotes: 2
Views: 3135
Reputation: 16571
A small improvement on the accepted answer is to use a generator expression inside concat
:
from pandas import read_pickle, concat
df = concat(read_pickle(f) for f in list_of_files)
By removing the list comprehension [...]
, we reduce the memory footprint of the operation, since there is no need to hold all of the results from list comprehension in memory at once.
Note that the list_of_files
variable should contain the list of files, e.g. globbed using pathlib
:
from pathlib import Path
list_of_files = Path('.').glob('*.pickle')
Upvotes: 1
Reputation: 18377
Rather than append them one by one, I suggest you use pd.concat()
and pass all the dataframes in a go.
Output = pd.concat([pd.read_pickle(r'location/'+x) for x in os.listdir('location')])
We can create the list of dataframes using list comprehensions, supposed this pickle files are saved in the same folder, and use pd.concat
to concatenate them all in a single dataframe.
Upvotes: 7