oceanbeach96
oceanbeach96

Reputation: 634

Merging two DataFrames in chunks

Objective

What would be the best approach to merge df1 and df2 together where df2 is merged in by chunks. Currently getting memory error when merging in df2

Without chunks I do the following:

df = df1.merge(df2, how='left', left_on=['x','y'], right_on['x','y']

Upvotes: 2

Views: 3108

Answers (2)

sunnynm
sunnynm

Reputation: 1

The current answer does not work properly so adding some clarification. The chunk is the one that needs to be joined on. If you try joining on the unchunked df you will get duplicate rows. Let us say you want to left join df2 on df1 like below:

df = df1.merge(df2, how='left', left_on=['x','y'], right_on['x','y']

Then df1 must be chunked. If df2 is chunked instead, the end result is that there will be alot of duplicate rows. I tried the opposite as proposed by the other answer and my df went from 7k rows to 24k rows because of the null space the nonjoined rows in each version of df1 merged to a chunk of df2 cause.

Thus the correct method is:

n = 200000  #chunk row size
list_df = [df1[i:i+n] for i in range(0, df1.shape[0],n)]

res = pd.DataFrame() 

for chunk in list_df:
    res = pd.concat([res, chunk.merge(df2, how='left', left_on=['x','y'], right_on['x','y'])]) 

Upvotes: 0

Mayank Porwal
Mayank Porwal

Reputation: 34086

You can split the large dataframe in chunks of let's say 200K rows.

n = 200000  #chunk row size
list_df = [df2[i:i+n] for i in range(0, df2.shape[0],n)]

Then merge all the chunked df's with df1:

res = pd.DataFrame() 

for chunk in list_df:
    res = pd.concat([res, df1.merge(chunk, how='left', left_on=['x','y'], right_on['x','y']) 

Upvotes: 2

Related Questions