Pandas apply based value based on value in another column

Question

I have a dataframe

   state   country
0  tx      us
1  ab      ca
2  fl      
3          
4  qc      ca
5  dawd

I'm trying to create a function that will check if there is a value in the country column. If there is NO value in country then check whether the value in state is a Canadian or American abbreviation. If it is a Canadian/American abbreviation, then assign the correct country name to the country column for that row.

For instance, in the sample DF above the function would see that in row 2, country is blank. Then it would see that the state, fl is part of the us. It would then assign the country to be us.

I'm thinking that this can be done with pd.apply() but I'm having trouble with the execution.

I've been playing around with the code below, but I'm doing something wrong...

def country_identifier(country):
    states = ["AK", "AL", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", 
              "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", 
              "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
    provinces = ["ON", "BC", "AB", "MB", "NB", "QC", "NL", "NT", "NS", "PE", "YT", "NU", "SK"]
    if country["country"] not None:
        if country["state"] in states:
            return "us"
        elif country["state"] in provinces:
            return "ca"
    else:
        return country

df2 = df[["country", "state"]].apply(country_identifier)
df2

roganjosh · Accepted Answer

You don't need to use nested np.where conditions because that gives a hard limit on the conditions that can be checked. Use df.loc unless your list of conditions expands quite dramatically; it will be faster than apply

import pandas as pd
import numpy as np

states = ["AK", "AL", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", 
              "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", 
              "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
provinces = ["ON", "BC", "AB", "MB", "NB", "QC", "NL", "NT", "NS", "PE", "YT", "NU", "SK"]

df = pd.DataFrame({'country': {0: 'us', 1: 'ca', 2: np.nan, 3: np.nan, 4: 'ca', 5: np.nan},
                   'state': {0: 'tx', 1: 'ab', 2: 'fl', 3: np.nan, 4: 'qc', 5: 'dawd'}})

df.loc[(df['country'].isnull()) 
       & (df['state'].str.upper().isin(states)), 'country'] = 'us'

df.loc[(df['country'].isnull()) 
       & (df['state'].str.upper().isin(provinces)), 'country'] = 'ca'

It is extensible because there's a variety of methods I could use to produce a dictionary and then generalise the replacements.

conditions = {'ca': provinces, 'us': states}

for country, values in conditions.items():
    df.loc[(df['country'].isnull()) 
           & (df['state'].str.upper().isin(values)), 'country'] = country

Pandas apply based value based on value in another column

Answers (2)

Related Questions