Yet another question on pandas partial string merge

Question

I know, there have been a number of very close examples, but I can't make them work for me. I want to add a column from another dataframe based on partial string match: The one string is contained in the other, but not necessarily at the beginning. Here is an example:

df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})    
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'], 
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})

df should get the continent from df2 attached to each 'citizenship' based on the string match / merge. I have been trying to apply the solution mentioned here Pandas: join on partial string match, like Excel VLOOKUP, but cannot get it to work

def get_continent(x):

     return df2.loc[df2['Country_Name'].str.contains(x), df2['Continent_Name']].iloc[0]

df['Continent_Name'] = df['citizenship'].apply(get_continent)

But it gives me a key error

KeyError: "None of [Index(['Asia', 'Europe', 'Antarctica', 'Africa', 'Oceania', 'Europe', 'Africa',
       'North America', 'Europe', 'Asia',
       ...
       'Asia', 'South America', 'Oceania', 'Oceania', 'Asia', 'Africa',
       'Oceania', 'Asia', 'Asia', 'Asia'],
      dtype='object', length=262)] are in the [columns]"

Anybody knows what is going on here?

user6386471 · Accepted Answer

I can see two issues with the code in your question:

In the function return line, you'll want to remove the df2[] bit in the second positional argument to df2.loc, to leave the column name as a string: df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
It then seems like the code from the linked answer only works when there is always a match between "country name" in df2 and "citizenship" in df.

So this works for example:

df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})    
df2 = pd.DataFrame({'Country_Name': ['Algeria', 'Andorra', 'Bahrain', 'Spain'], 
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})


def get_continent(x):

     return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]

df['Continent_Name'] = df['citizenship'].apply(get_continent)

#   citizenship Continent_Name
# 0    Algeria  Africa
# 1    Andorra  Europe
# 2    Bahrain  Asia
# 3    Spain    Europe

If you want to get the original code to work, you could put in a try/except:

df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']}) 
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'], 
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})

def get_continent(x):
    try:
        return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
    except IndexError:
        return None

df['Continent_Name'] = df['citizenship'].apply(get_continent)


#   citizenship Continent_Name
# 0  Algeria      Africa
# 1  Andorra      Europe
# 2  Bahrain      Asia
# 3  Spain        None

Yet another question on pandas partial string merge

Answers (2)

Related Questions