st_9
st_9

Reputation: 55

How to merge Pandas dataframes without duplicating columns

I have data of the form:

frame1 = pd.DataFrame({'supplier1_match0': ['x'], 'id': [1]})
frame2 = pd.DataFrame({'supplier1_match0': ['2x'], 'id': [2]})

and wish to left join multiple frames to a frame like this:

base_frame = pd.DataFrame({'id':[1,2,3]})

I merge on the id and get:

merged = base_frame.merge(frame1, how='left', left_on='id', right_on='id')
merged = merged.merge(frame2, how='left', left_on='id', right_on='id')

   id supplier1_match0_x supplier1_match0_y
0   1                  x                NaN
1   2                NaN                 2x
2   3                NaN                NaN

The column is duplicated and a 'y' is appended. Here is what I need:

id, supplier1_match0, ...
1,  x
2,  2x
3, NaN

Is there a simple way to achieve this? There is a similar question (Nested dictionary to multiindex dataframe where dictionary keys are column labels) but the data has a different shape. Note that I have multiple suppliers and that they have varying numbers of matches, so I can't assume the data will have a "rectangular" shape. Thanks in advance.

Upvotes: 3

Views: 5039

Answers (2)

RuchiMystical
RuchiMystical

Reputation: 1

newdf_merge= pd.merge(pd.DataFrame(df1), pd.DataFrame(df2), left_on=['common column name from df1'],right_on=['common column name from df2'],how='left')

It worked for me , hence wanted to share it here

Upvotes: -1

ALollz
ALollz

Reputation: 59519

Your problem is that you don't really want to just merge everything. You need to concat your first set of frames, then merge.

import pandas as pd
import numpy as np

base_frame.merge(pd.concat([frame1, frame2]), how='left')

#   id supplier1_match0
#0   1                x
#1   2               2x
#2   3              NaN

Alternatively, you could define base_frame so that it has all of the relevant columns of the other frames and set id to be the index and use .update. This ensures base_frame remains the same size, while the above does not. Though data would be over-written if there are multiple non-null values for a given cell.

base_frame = pd.DataFrame({'id':[1,2,3]}).assign(supplier1_match0 = np.NaN).set_index('id')

for df in [frame1, frame2]:
    base_frame.update(df.set_index('id'))

print(base_frame)

   supplier1_match0
id                 
1                 x
2                2x
3               NaN

Upvotes: 3

Related Questions