Matt
Matt

Reputation: 1632

Pandas apply based value based on value in another column

I have a dataframe

   state   country
0  tx      us
1  ab      ca
2  fl      
3          
4  qc      ca
5  dawd    

I'm trying to create a function that will check if there is a value in the country column. If there is NO value in country then check whether the value in state is a Canadian or American abbreviation. If it is a Canadian/American abbreviation, then assign the correct country name to the country column for that row.

For instance, in the sample DF above the function would see that in row 2, country is blank. Then it would see that the state, fl is part of the us. It would then assign the country to be us.

I'm thinking that this can be done with pd.apply() but I'm having trouble with the execution.

I've been playing around with the code below, but I'm doing something wrong...

def country_identifier(country):
    states = ["AK", "AL", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", 
              "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", 
              "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
    provinces = ["ON", "BC", "AB", "MB", "NB", "QC", "NL", "NT", "NS", "PE", "YT", "NU", "SK"]
    if country["country"] not None:
        if country["state"] in states:
            return "us"
        elif country["state"] in provinces:
            return "ca"
    else:
        return country

df2 = df[["country", "state"]].apply(country_identifier)
df2

Upvotes: 1

Views: 95

Answers (2)

roganjosh
roganjosh

Reputation: 13175

You don't need to use nested np.where conditions because that gives a hard limit on the conditions that can be checked. Use df.loc unless your list of conditions expands quite dramatically; it will be faster than apply

import pandas as pd
import numpy as np

states = ["AK", "AL", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", 
              "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", 
              "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
provinces = ["ON", "BC", "AB", "MB", "NB", "QC", "NL", "NT", "NS", "PE", "YT", "NU", "SK"]

df = pd.DataFrame({'country': {0: 'us', 1: 'ca', 2: np.nan, 3: np.nan, 4: 'ca', 5: np.nan},
                   'state': {0: 'tx', 1: 'ab', 2: 'fl', 3: np.nan, 4: 'qc', 5: 'dawd'}})

df.loc[(df['country'].isnull()) 
       & (df['state'].str.upper().isin(states)), 'country'] = 'us'

df.loc[(df['country'].isnull()) 
       & (df['state'].str.upper().isin(provinces)), 'country'] = 'ca'

It is extensible because there's a variety of methods I could use to produce a dictionary and then generalise the replacements.

conditions = {'ca': provinces, 'us': states}

for country, values in conditions.items():
    df.loc[(df['country'].isnull()) 
           & (df['state'].str.upper().isin(values)), 'country'] = country

Upvotes: 2

Vaishali
Vaishali

Reputation: 38415

You can use a nested np.where,

df['country'] = np.where(df['state'].str.upper().isin(states), 'us', np.where(df['state'].str.upper().isin(provinces), 'ca', np.nan))

    state   country
0   tx      us
1   ab      ca
2   fl      us
3   None    nan
4   qc      ca

Edit: Include the check for country first,

cond1 = df.loc[df['country'].isnull(), 'state'].str.upper().isin(states)
cond2 = df.loc[df['country'].isnull(), 'state'].str.upper().isin(provinces)
df.loc[df['country'].isnull(), 'country'] = np.where(cond1, 'us', np.where(cond2, 'ca', np.nan))



    state   country
0   tx      us
1   ab      ca
2   fl      us
3   NaN     nan
4   qc      ca
5   dawd    nan

Another way using numpy select; one liner and scales well with multiple conditions,

df.loc[df['country'].isnull(), 'country'] = np.select([cond1, cond2], ['us', 'ca'], np.nan)

Upvotes: 1

Related Questions