Reputation: 147
Is there a way to bin a pandas column of STRINGS into custom groups of custom names. Like similar to the cuts function but for strings.
For example maybe using a list of lists to define what groups are.
grouping_lists = [['Pakistan', 'China', 'Iran'], ['Germany', 'UK', 'Poland'],
['Australia'], ['USA']]
Corresponding to names ['Asia', 'Europe', 'Australia', 'Other']
.
And should something not exist in the list, it get's marked as 'Other'
or something.
Example:
my_id country_name
0 100 Pakistan
1 200 Germany
2 140 Australia
3 400 Germany
4 225 China
5 125 Pakistan
6 600 Poland
7 0 Austria
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
Upvotes: 1
Views: 2698
Reputation: 670
You can use a lambda if you're not worried about speed.
groups = {
"Asia": ["Pakistan", "China", "Iran"],
"Europe": ["Germany", "UK", "Poland"],
"Australia": ["Australia"],
}
df["Groups"] = (
df["country_names"]
.apply(lambda x: [k for k in groups.keys() if x in groups[k]])
.str[0]
.fillna("Other")
)
Upvotes: 1
Reputation: 21749
Here's a way where you can don't need to create map dict manually (in case it is big):
grouping_lists = [['Pakistan', 'China', 'Iran'], ['Germany', 'UK', 'Poland'],
['Australia'], ['USA']]
names = ['Asia', 'Europe', 'Australia', 'Other']
# create a df with mapping information
maps = (pd.DataFrame({'Groups': names, 'country_name': grouping_lists})
.explode('country_name')
.reset_index(drop=True))
# join maps
df = df.merge(maps, on = 'country_name', how='left').fillna("Other")
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
Upvotes: 3
Reputation: 14236
Instead of binning your answers you could modify your grouping lists to be a dictionary and then use pandas.Series.map
country_map = {
'Pakistan': 'Asia', 'China': 'Asia',
'Iran': 'Asia', 'Germany': 'Europe',
'UK': 'Europe', 'Poland': 'Europe',
'Australia': 'Australia', 'USA': 'Other'
}
df.assign(Groups=df.country_name.map(country_map)).fillna('Other')
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
Upvotes: 1