Reputation: 550
I have a problem. I have two dataframes that I would like to merge with each other. The problem is that if I merge them together I get a MemoryError because the size of the dataframe has grown many times. I have found the following article I looked at pandas merge(how="inner") result is bigger than both dataframes. I still can't merge them at all because I'm running out of memory.
Is there an option to just merge the first element and ignore the other duplicates?
For example, my dataframe df_1
is structured like this (see below)
Is there an option that df_2
only writes its values in and if there are duplicates these are simply ignored?
Dataframe df_1
id_x B C id_y new_column
1 4 9 1 a
2 5 8 2 b
3 6 7 3 c
3 6 7 3 z # should be ignored
df_merged= pd.merge(df_1,
df_2, how='inner',
left_on=['id_x'], right_on=['id_y'],
suffixes=['', '_right'])
Upvotes: 0
Views: 1920
Reputation: 41
So it looks like df_2 must have multiple values for id=3. You could use drop duplicates with a subset of columns for df_2 before the merge (or within the merge). Something like pd.merge(df_1,df_2.drop_duplicates(subset=['id_y']), how='inner',left_on=['id_x'], right_on=['id_y'],suffixes=['', '_right'])
. This will make sure that df_2 only has 1 value for each id_y and won't fan out the join.
Upvotes: 3