Don M
Don M

Reputation: 986

Pandas Forcing Suffixes in To Dataframes on merge

It is giving me duplicate/suffixed columns. Most questions I am seeing people want a suffix on their data, I am hoping for the opposite.

So far those columns don't seem to be useful and they all seem to end up with NaN from a previous merge.

I want the merge to work by overwritting if it finds a duplicate column or adding the column to the dataframe if it doesn't exist. I was hoping the overwritting would happen based parameters of the merge but it seems to default to keeping all columns and just suffixing.

Example:

data4 = {'col1': [1, 2], 'col2': [3, 4]}
df4 = pd.DataFrame(data=data4)

data5  = {'col1': [1, 2], 'col2': [3, 4], 'col3': [9, 2]}

df5 = pd.DataFrame(data=data5)

df_merged = pd.merge(df4, df5, how="outer", left_on="col1", right_on="col1")
   col1  col2_x  col2_y  col3
0     1       3       3     9
1     2       4       4     2


data6  = {'col1': [1, 2], 'col2': [3, 4], 'col3': [9, 2]}
df6 = pd.DataFrame(data=data6)

pd.merge(df6, df_merged, how="right", left_on="col1", right_on="col1")
   col1  col2  col3_x  col2_x  col2_y  col3_y
0     1     3       9       3       3       9
1     2     4       2       4       4       2

Ideally, in the last merge I would like it to overwrite, the left or right dataframe based on the params, instead of trying to preserve data and adding suffixes.

Upvotes: 0

Views: 940

Answers (2)

Kavitha
Kavitha

Reputation: 310

Merge in pandas is similar to 'joins' in SQL. It will result in common columns being renamed with prefixes to hold the values of the 2 merging dataframes.

For your case, you want to update the dataframe instead.

Upvotes: 0

Henry Yik
Henry Yik

Reputation: 22503

What you need is combine_first:

data4 = {'col1': [1, 2], 'col2': [3, 4]}
df4 = pd.DataFrame(data=data4)

data5  = {'col1': [1, 2], 'col2': [3, 4], 'col3': [9, 2]}

df5 = pd.DataFrame(data=data5)

data6  = {'col1': [1, 2], 'col2': [3, 4], 'col3': [9, 2]}

df6 = pd.DataFrame(data=data6)

print (df4.combine_first(df5).combine_first(df6))

   col1  col2  col3
0     1     3   9.0
1     2     4   2.0

Upvotes: 2

Related Questions