DavidS1992
DavidS1992

Reputation: 915

Merging many pickle Dataframes into one

I have 600 Dataframes saved and stored as .pickle and I'd like to merge (or rather append) them into one DataFrame. The total size of them is 10GB.

When I read each of them and append them into one big DataFrame and then save the full version to dist the entire process takes 2 hours on 16GB machine.

I think it takes a lot of time because each time I append a new DataFrame system allocates new memory space for the entire new DataFrame?

How can I do this faster?

Upvotes: 2

Views: 3135

Answers (2)

SultanOrazbayev
SultanOrazbayev

Reputation: 16571

A small improvement on the accepted answer is to use a generator expression inside concat:

from pandas import read_pickle, concat

df = concat(read_pickle(f) for f in list_of_files)

By removing the list comprehension [...], we reduce the memory footprint of the operation, since there is no need to hold all of the results from list comprehension in memory at once.

Note that the list_of_files variable should contain the list of files, e.g. globbed using pathlib:

from pathlib import Path

list_of_files = Path('.').glob('*.pickle')

Upvotes: 1

Celius Stingher
Celius Stingher

Reputation: 18377

Rather than append them one by one, I suggest you use pd.concat() and pass all the dataframes in a go.

Output = pd.concat([pd.read_pickle(r'location/'+x) for x in os.listdir('location')])

We can create the list of dataframes using list comprehensions, supposed this pickle files are saved in the same folder, and use pd.concat to concatenate them all in a single dataframe.

Upvotes: 7

Related Questions