Pythonista anonymous
Pythonista anonymous

Reputation: 8970

Pandas merge: how to return the column on which you have done the merge?

I have a dataframe with some categorical data. I want to create a new column which shows only some of those values, and converts the others to 'other'. E.g. to show only the top 10 cities, or, in the example below, show only two colours and convert the others to 'other'.

I want to do it via a pandas.merge, like a SQL outer join: on one hand my table, on the other side a table with only the values I want to keep (ie not convert to 'others').

The problem is, and it took me a bit of debugging and swearing to find that out, that pandas.merge does not return both columns on which you have done the merge, even if one of the columns has nulls.

The solution I have found is to create another column with the same values - which I think would make anyone familiar with SQL cringe. Is there a more elegant way?

This is the code to show what I mean:

import pandas as pd

df=pd.DataFrame()
df['colour']=['yellow','yellow','green','red']

mycols=pd.DataFrame()
mycols['colour']=['yellow','red']

# after this merge, I have no way of knowing which colour in df has no match in mycols
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))

# so I need to create another column in mycols
mycols['colour copied']=mycols['colour']
newdf2=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
newdf2['colour copied']=newdf2['colour copied'].fillna('other')
newdf2.rename(columns={'colour copied': 'colour - reclassified'})

Upvotes: 2

Views: 667

Answers (1)

jezrael
jezrael

Reputation: 863501

You can add parameter indicator=True for return if matched both, left_only or right_only values:

newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'), indicator=True)
print (newdf)

   colour     _merge
0  yellow       both
1  yellow       both
2   green  left_only
3     red       both

And then set values by condition - here if _merge is both set column colour, else value colour by numpy.where, DataFrame.pop is for extract column:

newdf['colour copied'] = np.where(newdf.pop('_merge') == 'both', newdf['colour'], 'colour')
print (newdf)

   colour colour copied
0  yellow        yellow
1  yellow        yellow
2   green         other
3     red           red

But if working only with one column is possible simplier alternative - compare by Series.isin for test membership:

df['colour copied'] = np.where(df['colour'].isin(mycols['colour']), df['colour'], 'other')
print (df)
   colour colour copied
0  yellow        yellow
1  yellow        yellow
2   green         other
3     red           red

Upvotes: 3

Related Questions