Reputation: 8970
I have a dataframe with some categorical data. I want to create a new column which shows only some of those values, and converts the others to 'other'. E.g. to show only the top 10 cities, or, in the example below, show only two colours and convert the others to 'other'.
I want to do it via a pandas.merge, like a SQL outer join: on one hand my table, on the other side a table with only the values I want to keep (ie not convert to 'others').
The problem is, and it took me a bit of debugging and swearing to find that out, that pandas.merge does not return both columns on which you have done the merge, even if one of the columns has nulls.
The solution I have found is to create another column with the same values - which I think would make anyone familiar with SQL cringe. Is there a more elegant way?
This is the code to show what I mean:
import pandas as pd
df=pd.DataFrame()
df['colour']=['yellow','yellow','green','red']
mycols=pd.DataFrame()
mycols['colour']=['yellow','red']
# after this merge, I have no way of knowing which colour in df has no match in mycols
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
# so I need to create another column in mycols
mycols['colour copied']=mycols['colour']
newdf2=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
newdf2['colour copied']=newdf2['colour copied'].fillna('other')
newdf2.rename(columns={'colour copied': 'colour - reclassified'})
Upvotes: 2
Views: 667
Reputation: 863501
You can add parameter indicator=True
for return if matched both, left_only or right_only values:
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'), indicator=True)
print (newdf)
colour _merge
0 yellow both
1 yellow both
2 green left_only
3 red both
And then set values by condition - here if _merge
is both
set column colour
, else value colour
by numpy.where
, DataFrame.pop
is for extract column:
newdf['colour copied'] = np.where(newdf.pop('_merge') == 'both', newdf['colour'], 'colour')
print (newdf)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red
But if working only with one column is possible simplier alternative - compare by Series.isin
for test membership:
df['colour copied'] = np.where(df['colour'].isin(mycols['colour']), df['colour'], 'other')
print (df)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red
Upvotes: 3