Reputation: 1632
I have a dataframe
state country
0 tx us
1 ab ca
2 fl
3
4 qc ca
5 dawd
I'm trying to create a function that will check if there is a value in the country
column. If there is NO value in country
then check whether the value in state
is a Canadian or American abbreviation. If it is a Canadian/American abbreviation, then assign the correct country name to the country
column for that row.
For instance, in the sample DF above the function would see that in row 2
, country
is blank. Then it would see that the state
, fl
is part of the us. It would then assign the country to be us
.
I'm thinking that this can be done with pd.apply()
but I'm having trouble with the execution.
I've been playing around with the code below, but I'm doing something wrong...
def country_identifier(country):
states = ["AK", "AL", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY",
"LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND",
"OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
provinces = ["ON", "BC", "AB", "MB", "NB", "QC", "NL", "NT", "NS", "PE", "YT", "NU", "SK"]
if country["country"] not None:
if country["state"] in states:
return "us"
elif country["state"] in provinces:
return "ca"
else:
return country
df2 = df[["country", "state"]].apply(country_identifier)
df2
Upvotes: 1
Views: 95
Reputation: 13175
You don't need to use nested np.where
conditions because that gives a hard limit on the conditions that can be checked. Use df.loc
unless your list of conditions expands quite dramatically; it will be faster than apply
import pandas as pd
import numpy as np
states = ["AK", "AL", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY",
"LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND",
"OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
provinces = ["ON", "BC", "AB", "MB", "NB", "QC", "NL", "NT", "NS", "PE", "YT", "NU", "SK"]
df = pd.DataFrame({'country': {0: 'us', 1: 'ca', 2: np.nan, 3: np.nan, 4: 'ca', 5: np.nan},
'state': {0: 'tx', 1: 'ab', 2: 'fl', 3: np.nan, 4: 'qc', 5: 'dawd'}})
df.loc[(df['country'].isnull())
& (df['state'].str.upper().isin(states)), 'country'] = 'us'
df.loc[(df['country'].isnull())
& (df['state'].str.upper().isin(provinces)), 'country'] = 'ca'
It is extensible because there's a variety of methods I could use to produce a dictionary and then generalise the replacements.
conditions = {'ca': provinces, 'us': states}
for country, values in conditions.items():
df.loc[(df['country'].isnull())
& (df['state'].str.upper().isin(values)), 'country'] = country
Upvotes: 2
Reputation: 38415
You can use a nested np.where,
df['country'] = np.where(df['state'].str.upper().isin(states), 'us', np.where(df['state'].str.upper().isin(provinces), 'ca', np.nan))
state country
0 tx us
1 ab ca
2 fl us
3 None nan
4 qc ca
Edit: Include the check for country first,
cond1 = df.loc[df['country'].isnull(), 'state'].str.upper().isin(states)
cond2 = df.loc[df['country'].isnull(), 'state'].str.upper().isin(provinces)
df.loc[df['country'].isnull(), 'country'] = np.where(cond1, 'us', np.where(cond2, 'ca', np.nan))
state country
0 tx us
1 ab ca
2 fl us
3 NaN nan
4 qc ca
5 dawd nan
Another way using numpy select; one liner and scales well with multiple conditions,
df.loc[df['country'].isnull(), 'country'] = np.select([cond1, cond2], ['us', 'ca'], np.nan)
Upvotes: 1