Vibhu
Vibhu

Reputation: 361

How to categories rows of a dataframe based on columns on another dataframe?

I am building a dataframe df like this

df = pd.DataFrame({
    'South East Asia': ['India', 'Sri Lanka', 'Mali', 'Maldives'],
    'Europe': ['France', 'Italy', 'Germany', 'Spain'],
    'North America': ['Mexico', 'US', 'Canada', np.nan]},
    columns=['South East Asia', 'Europe', 'North America'])

Which looks like this:

  South East Asia   Europe North America
0           India   France        Mexico
1       Sri Lanka    Italy            US
2            Mali  Germany        Canada
3        Maldives    Spain           NaN

I have a customer data (~40K rows) in another dataframe df2 with a column Country with countries from all the 11 countries listed above.

I want to make a new column in df2 called Continent which would have the corresponding continent as mentioned in my dataframe df. So my desired output for

print(df2.loc[:, 'Country':'Continent'])

would be

    Country    Continent
0   France        Europe
1    Italy        Europe
2   Mexico    North America
...

Note: While looking up this query online, I found that there are some ready-made tools to categories countries into continents, which is understandable since this must be required frequently within the community. But I want to learn how to categorise data based on datasets in another dataframes. That's why I'm trying to achieve the same in only these 11 countries and 3 continents.

Thanks in advance!

Upvotes: 2

Views: 654

Answers (2)

jpp
jpp

Reputation: 164703

Using pandas.melt, you can create a series mapping country to continent:

s = pd.melt(df).set_index('value')['variable']

df2['Continent'] = df2['Country'].map(s).fillna('DefaultContinent')

For NaN, you may want to take extra care. For example:

mapper = pd.melt(df).dropna(subset=['value']).set_index('value')['variable']

Upvotes: 1

jezrael
jezrael

Reputation: 862851

Use dict comprehension with flattening for dictionary:

s = {k:v for a, b in df.items() for k, v in dict.fromkeys(b, a).items()}
#if want remove NaNs
#s = {k:v for a, b in df.items() for k, v in dict.fromkeys(b, a).items() if pd.notnull(k)}

Or create Series with stack:

s = df.stack().reset_index().set_index(0)['level_1']
print (s)
India        South East Asia
France                Europe
Mexico         North America
Sri Lanka    South East Asia
Italy                 Europe
US             North America
Mali         South East Asia
Germany               Europe
Canada         North America
Maldives     South East Asia
Spain                 Europe
Name: level_1, dtype: object

and last map:

df2['Continent'] = df2['Country'].map(s)

Upvotes: 3

Related Questions