Gary Becker
Gary Becker

Reputation: 53

New column in pandas dataframe based on existing column values with conditions list

Following this link: New column in pandas dataframe based on existing column values

I have a data frame with a column called 'Country' that lists several countries in the world. I need to create another column with a region specifier like 'Europe'. I have three lists of countries that belong to several regions, so if the state in df ['Country'] matches a state in the 'Europe' list, the 'Europe' specifier is inserted in the new column df[' Region'].

my data is: https://sendeyo.com/up/d/2acd2eb849

The problem is that when I use the solutions expressed in the previous link, they work for the sample dataframes, but not for my database. my dataframe like this:

Year    Country Population  GDP 
1870    Austria  4,520     8,419    
1870    Belgium  5,096     13,716   
1870    Denmark  1,888     3,782    
1870    Finland  1,754     1,999    
1870    France   38,440    72,100   

My lists:

Europa = ["Austria", "Belgium", "Denmark"]

RamasOccidentales = ["Australia","New Zealand","Canada","United States"]

Latinoamerica = ["Brazil","Chile","Uruguay"]

Asia = ["Indonesia","Japan","Sri Lanka"]

The expected result

Year    Country Population  GDP Region
1870    Austria 4,520   8,419   Europa 
1870    Belgium 5,096   13,716  Europa 
1870    Denmark 1,888   3,782   Europa 
1870    Finland 1,754   1,999   Europa 
1870    France  38,440  72,100  Europa 

Here is the code I tried:

def Continent(country):
    return "Europa" if country in Europa else "Unknown"

df['Region'] = df['Country'].apply(Continent)

Thanks.

Upvotes: 2

Views: 1059

Answers (3)

ansev
ansev

Reputation: 30930

Use np.select + Series.isin:

Europa = ["Austria", "Belgium", "Denmark",'France','Finland']

RamasOccidentales = ["Australia","New Zealand","Canada","United States"]

Latinoamerica = ["Brazil","Chile","Uruguay"]

Asia = ["Indonesia","Japan","Sri Lanka"]


#using np.select
cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
values=['Europa','RamasOccidentales','Latinoamerica','Asia']
df['Region']=np.select(cond,values)

print(df)

   Year  Country Population     GDP  Region
0  1870  Austria      4,520   8,419  Europa
1  1870  Belgium      5,096  13,716  Europa
2  1870  Denmark      1,888   3,782  Europa
3  1870  Finland      1,754   1,999  Europa
4  1870   France     38,440  72,100  Europa

Also you can use a dict to create list of cond and values.It is faster:

classification_countries={'Europa':Europa,
                          'RamasOccidentales':RamasOccidentales,
                          'Latinoamerica':Latinoamerica ,
                          'Asia':Asia}

dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}


df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
print(df)
   Year  Country Population     GDP  Region
0  1870  Austria      4,520   8,419  Europa
1  1870  Belgium      5,096  13,716  Europa
2  1870  Denmark      1,888   3,782  Europa
3  1870  Finland      1,754   1,999  Europa
4  1870   France     38,440  72,100  Europa

or

classification_countries={'Europa':Europa,
                          'RamasOccidentales':RamasOccidentales,
                          'Latinoamerica':Latinoamerica ,
                          'Asia':Asia}

cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
values=[ key for key in classification_countries]

df['Region']=np.select(cond,values)
print(df)

   Year  Country Population     GDP  Region
0  1870  Austria      4,520   8,419  Europa
1  1870  Belgium      5,096  13,716  Europa
2  1870  Denmark      1,888   3,782  Europa
3  1870  Finland      1,754   1,999  Europa
4  1870   France     38,440  72,100  Europa

comparison with the solution of jezrael measuring after the creation of the dictionary until the execution of print (df)

%%timeit
dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}   
df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
print(df)
#5.06 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
values=[ key for key in classification_countries]

df['Region']=np.select(cond,values)
print(df)
#5.18 ms ± 652 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jezrael

%%timeit

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}

df['Region'] = df['Country'].map(d1)

print (df)
#7.88 ms ± 824 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 1

jezrael
jezrael

Reputation: 863226

Use Series.map with dictionary created from lists:

Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]

d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
     'Latinoamerica':Latinoamerica,'Asia':Asia}

#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}

df['Region'] = df['Country'].map(d1)

print (df)
   Year  Country Population     GDP  Region
0  1870  Austria      4,520   8,419  Europa
1  1870  Belgium      5,096  13,716  Europa
2  1870  Denmark      1,888   3,782  Europa
3  1870  Finland      1,754   1,999  Europa
4  1870   France     38,440  72,100  Europa

print (d1)

{'Austria': 'Europa', 'Belgium': 'Europa', 'Denmark': 'Europa', 
 'France': 'Europa', 'Finland': 'Europa', 
 'Australia': 'RamasOccidentales', 
 'New Zealand': 'RamasOccidentales', 
 'Canada': 'RamasOccidentales', 
 'United States': 'RamasOccidentales', 
 'Brazil': 'Latinoamerica', 'Chile': 'Latinoamerica', 
 'Uruguay': 'Latinoamerica', 'Indonesia': 'Asia',
 'Japan': 'Asia', 'Sri Lanka': 'Asia'}

Performance is for 10k rows 2.58 times better:

np.random.seed(2019)

Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]

d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
     'Latinoamerica':Latinoamerica,'Asia':Asia}

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df = pd.DataFrame({'Country': np.random.choice(list(d1.keys()), size=10000)})

In [280]: %%timeit
     ...: d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
     ...: 
     ...: df['Region'] = df['Country'].map(d1)
     ...: 
3.04 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [281]: %%timeit
     ...: classification_countries={'Europa':Europa,
     ...:                           'RamasOccidentales':RamasOccidentales,
     ...:                           'Latinoamerica':Latinoamerica ,
     ...:                           'Asia':Asia}
     ...: 
     ...: cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
     ...: values=[ key for key in classification_countries]
     ...: 
     ...: df['Region']=np.select(cond,values)
     ...: 
7.86 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [282]: %%timeit
     ...: cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
     ...: values=['Europa','RamasOccidentales','Latinoamerica','Asia']
     ...: df['Region']=np.select(cond,values)
     ...: 
7.96 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [293]: %%timeit
     ...: classification_countries={'Europa':Europa,
     ...:                           'RamasOccidentales':RamasOccidentales,
     ...:                           'Latinoamerica':Latinoamerica ,
     ...:                           'Asia':Asia}
     ...: 
     ...: dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
     ...: 
     ...: 
     ...: df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
     ...: 
8.54 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 3

Jacob Turpin
Jacob Turpin

Reputation: 190

A very similar but alternate approach would be to use a dictionary-based lookup to determine the country. In this implementation, you would create a single dictionary with countries as keys and their corresponding region as the paired values.

region_map = {
    'Austria': 'Europa',
    'Brazil': 'Latinoamerica',
    'Japan': 'Asia'  # so on and so forth
}
df['Region'] = df['Country'].apply(lambda c: region_map.get(c, 'Unknown'))

This will produce either the corresponding country from your dictionary map or the string 'Unknown' if no key-value pair exists.

Upvotes: 0

Related Questions