Reputation: 7913
I guess this question needs some insight into the implementation of concat.
Say, I have 30 files, 1G each, and I can only use up to 32 G memory. I loaded the files into a list of DataFrames, called 'list_of_pieces'. This list_of_pieces should be ~ 30G in size, right?
if I do pd.concat(list_of_pieces)
, does concat allocate another 30G (or maybe 10G 15G) in the heap and do some operations, or it run the concatation 'in-place' without allocating new memory?
anyone knows this?
Thanks!
Upvotes: 19
Views: 20761
Reputation: 21
Try this:
dfs = [df1, df2]
temp = pd.concat(dfs, copy=False, ignore_index=False)
df1.drop(df1.index[0:], inplace=True)
df1[temp.columns] = temp
Upvotes: 2
Reputation: 129018
The answer is no, this is not an in-place operation; np.concatenate is used under the hood, see here: Concatenate Numpy arrays without copying
A better approach to the problem is to write each of these pieces to an HDFStore
table, see here: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables for docs, and here: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore for some recipies.
Then you can select whatever portions (or even the whole set) as needed (by query or even row number)
Certain types of operations can even be done when the data is on-disk: https://github.com/pydata/pandas/issues/3202?source=cc, and here: http://pytables.github.io/usersguide/libref/expr_class.html#
Upvotes: 16