Reputation: 634
Objective
What would be the best approach to merge df1
and df2
together where df2
is merged in by chunks. Currently getting memory error when merging in df2
Without chunks I do the following:
df = df1.merge(df2, how='left', left_on=['x','y'], right_on['x','y']
Upvotes: 2
Views: 3108
Reputation: 1
The current answer does not work properly so adding some clarification. The chunk is the one that needs to be joined on. If you try joining on the unchunked df you will get duplicate rows. Let us say you want to left join df2 on df1 like below:
df = df1.merge(df2, how='left', left_on=['x','y'], right_on['x','y']
Then df1 must be chunked. If df2 is chunked instead, the end result is that there will be alot of duplicate rows. I tried the opposite as proposed by the other answer and my df went from 7k rows to 24k rows because of the null space the nonjoined rows in each version of df1 merged to a chunk of df2 cause.
Thus the correct method is:
n = 200000 #chunk row size
list_df = [df1[i:i+n] for i in range(0, df1.shape[0],n)]
res = pd.DataFrame()
for chunk in list_df:
res = pd.concat([res, chunk.merge(df2, how='left', left_on=['x','y'], right_on['x','y'])])
Upvotes: 0
Reputation: 34086
You can split the large dataframe in chunks of let's say 200K rows.
n = 200000 #chunk row size
list_df = [df2[i:i+n] for i in range(0, df2.shape[0],n)]
Then merge all the chunked df's with df1:
res = pd.DataFrame()
for chunk in list_df:
res = pd.concat([res, df1.merge(chunk, how='left', left_on=['x','y'], right_on['x','y'])
Upvotes: 2