noah
noah

Reputation: 2776

Reduce memory usage of pandas concat for lots of dataframes

I have a bunch (15,000+) of small data frames that I need to concatenate column-wise to make one very large (100,000x1000) data frame in pandas. There are two (obvious) concerns I have, speed and memory usage.

The following is one methodology I've seen highly endorsed on Stack Overflow.

dfList = [df1, df2, ..., df15000] #made by appending in a for loop
df_out = pd.concat(dfList, axis=1)

This is great for speed. It's simple code that is easy to understand. However, it uses a fairly large amount of memory. My understanding is that Pandas' concat function works by making a new big dataframe and then copying all the info over, essentially doubling the amount of memory consumed by the program.

How do I avoid this large memory overhead with minimal reduction in speed?

I tried just adding columns one by one to the first df in a for loop. Great for memory (1+1/15,000), terrible for speed.

Then I came up with the following. I replace the list with a deque and do concatenation peicewise. It saves memory (4.1GB vs 5.4GB on most recent run), at a manageable speed decrease (<30seconds added here on a 5-6min total length script), but I can't seem to figure out why does this save memory?

dfDq = collections.deque()
#add all 15,000 dfs to deque
while(len(dfDq)>2):  
    dfDq.appendleft(pd.concat([dfDq.pop(), dfDq.pop(), dfDq.pop()], axis=1))
if(len(dfDq)==2): df_out = pd.concat([dfDq.pop(), dfDq.pop()], axis=1)
else: df_out=dfDq.pop()

The last step of this peicewise concatenation should still use 2x the memory if my understanding of the pd.concat() function is correct. What is making this work? While the numbers I quoted above for speed increase and memory saved are specific to that one run, the general trend has been the same over numerous runs.

In addition to trying to figure out why the above works, also open to other suggestions for methodology.

Upvotes: 5

Views: 6529

Answers (1)

John Zwinck
John Zwinck

Reputation: 249093

Just create the full-size DataFrame in advance:

df = pd.DataFrame(index=pd.RangeIndex(0, N), columns=[...])

Then write to it in sections:

col = 0
for path in paths:
    part = pd.read_csv(path)
    df.iloc[:,col:col+part.shape[1]] = part
    col += part.shape[1]

Upvotes: 3

Related Questions