Test
Test

Reputation: 550

When merging, the data frame becomes much larger

I have a problem. I have two dataframes that I would like to merge with each other. The problem is that if I merge them together I get a MemoryError because the size of the dataframe has grown many times. I have found the following article I looked at pandas merge(how="inner") result is bigger than both dataframes. I still can't merge them at all because I'm running out of memory.

Is there an option to just merge the first element and ignore the other duplicates?

For example, my dataframe df_1 is structured like this (see below) Is there an option that df_2 only writes its values in and if there are duplicates these are simply ignored?

Dataframe df_1

id_x  B   C  id_y new_column
1     4   9  1    a
2     5   8  2    b
3     6   7  3    c
3     6   7  3    z # should be ignored
df_merged= pd.merge(df_1,
                    df_2, how='inner',
                    left_on=['id_x'], right_on=['id_y'],
                    suffixes=['', '_right'])

Upvotes: 0

Views: 1920

Answers (1)

Ted
Ted

Reputation: 41

So it looks like df_2 must have multiple values for id=3. You could use drop duplicates with a subset of columns for df_2 before the merge (or within the merge). Something like pd.merge(df_1,df_2.drop_duplicates(subset=['id_y']), how='inner',left_on=['id_x'], right_on=['id_y'],suffixes=['', '_right']). This will make sure that df_2 only has 1 value for each id_y and won't fan out the join.

Upvotes: 3

Related Questions