Reputation: 53
Following this link: New column in pandas dataframe based on existing column values
I have a data frame with a column called 'Country' that lists several countries in the world. I need to create another column with a region specifier like 'Europe'. I have three lists of countries that belong to several regions, so if the state in df ['Country'] matches a state in the 'Europe' list, the 'Europe' specifier is inserted in the new column df[' Region'].
my data is: https://sendeyo.com/up/d/2acd2eb849
The problem is that when I use the solutions expressed in the previous link, they work for the sample dataframes, but not for my database. my dataframe like this:
Year Country Population GDP
1870 Austria 4,520 8,419
1870 Belgium 5,096 13,716
1870 Denmark 1,888 3,782
1870 Finland 1,754 1,999
1870 France 38,440 72,100
My lists:
Europa = ["Austria", "Belgium", "Denmark"]
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]
The expected result
Year Country Population GDP Region
1870 Austria 4,520 8,419 Europa
1870 Belgium 5,096 13,716 Europa
1870 Denmark 1,888 3,782 Europa
1870 Finland 1,754 1,999 Europa
1870 France 38,440 72,100 Europa
Here is the code I tried:
def Continent(country):
return "Europa" if country in Europa else "Unknown"
df['Region'] = df['Country'].apply(Continent)
Thanks.
Upvotes: 2
Views: 1059
Reputation: 30930
Use np.select
+ Series.isin
:
Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]
#using np.select
cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
values=['Europa','RamasOccidentales','Latinoamerica','Asia']
df['Region']=np.select(cond,values)
print(df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa
Also you can use a dict to create list of cond and values.It is faster:
classification_countries={'Europa':Europa,
'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica ,
'Asia':Asia}
dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
print(df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa
or
classification_countries={'Europa':Europa,
'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica ,
'Asia':Asia}
cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
values=[ key for key in classification_countries]
df['Region']=np.select(cond,values)
print(df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa
comparison with the solution of jezrael measuring after the creation of the dictionary until the execution of print (df)
%%timeit
dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
print(df)
#5.06 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
values=[ key for key in classification_countries]
df['Region']=np.select(cond,values)
print(df)
#5.18 ms ± 652 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@jezrael
%%timeit
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df['Region'] = df['Country'].map(d1)
print (df)
#7.88 ms ± 824 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1
Reputation: 863226
Use Series.map
with dictionary created from lists:
Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]
d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica,'Asia':Asia}
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df['Region'] = df['Country'].map(d1)
print (df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa
print (d1)
{'Austria': 'Europa', 'Belgium': 'Europa', 'Denmark': 'Europa',
'France': 'Europa', 'Finland': 'Europa',
'Australia': 'RamasOccidentales',
'New Zealand': 'RamasOccidentales',
'Canada': 'RamasOccidentales',
'United States': 'RamasOccidentales',
'Brazil': 'Latinoamerica', 'Chile': 'Latinoamerica',
'Uruguay': 'Latinoamerica', 'Indonesia': 'Asia',
'Japan': 'Asia', 'Sri Lanka': 'Asia'}
Performance is for 10k rows 2.58 times better:
np.random.seed(2019)
Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]
d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica,'Asia':Asia}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df = pd.DataFrame({'Country': np.random.choice(list(d1.keys()), size=10000)})
In [280]: %%timeit
...: d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
...:
...: df['Region'] = df['Country'].map(d1)
...:
3.04 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [281]: %%timeit
...: classification_countries={'Europa':Europa,
...: 'RamasOccidentales':RamasOccidentales,
...: 'Latinoamerica':Latinoamerica ,
...: 'Asia':Asia}
...:
...: cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
...: values=[ key for key in classification_countries]
...:
...: df['Region']=np.select(cond,values)
...:
7.86 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [282]: %%timeit
...: cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
...: values=['Europa','RamasOccidentales','Latinoamerica','Asia']
...: df['Region']=np.select(cond,values)
...:
7.96 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [293]: %%timeit
...: classification_countries={'Europa':Europa,
...: 'RamasOccidentales':RamasOccidentales,
...: 'Latinoamerica':Latinoamerica ,
...: 'Asia':Asia}
...:
...: dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
...:
...:
...: df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
...:
8.54 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 3
Reputation: 190
A very similar but alternate approach would be to use a dictionary-based lookup to determine the country. In this implementation, you would create a single dictionary with countries as keys and their corresponding region as the paired values.
region_map = {
'Austria': 'Europa',
'Brazil': 'Latinoamerica',
'Japan': 'Asia' # so on and so forth
}
df['Region'] = df['Country'].apply(lambda c: region_map.get(c, 'Unknown'))
This will produce either the corresponding country from your dictionary map or the string 'Unknown' if no key-value pair exists.
Upvotes: 0