Reputation: 272
I know, there have been a number of very close examples, but I can't make them work for me. I want to add a column from another dataframe based on partial string match: The one string is contained in the other, but not necessarily at the beginning. Here is an example:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
df should get the continent from df2 attached to each 'citizenship' based on the string match / merge. I have been trying to apply the solution mentioned here Pandas: join on partial string match, like Excel VLOOKUP, but cannot get it to work
def get_continent(x):
return df2.loc[df2['Country_Name'].str.contains(x), df2['Continent_Name']].iloc[0]
df['Continent_Name'] = df['citizenship'].apply(get_continent)
But it gives me a key error
KeyError: "None of [Index(['Asia', 'Europe', 'Antarctica', 'Africa', 'Oceania', 'Europe', 'Africa',\n 'North America', 'Europe', 'Asia',\n ...\n 'Asia', 'South America', 'Oceania', 'Oceania', 'Asia', 'Africa',\n 'Oceania', 'Asia', 'Asia', 'Asia'],\n dtype='object', length=262)] are in the [columns]"
Anybody knows what is going on here?
Upvotes: 0
Views: 294
Reputation: 386
One way you could do this is create a citizenship
column in df2
and use that to join the dataframes together. I think the easiest way to make this column would be to use regex.
citizenship_list = df['citizenship'].unique().tolist()
citizenship_regex = r"(" + r"|".join(citizenship_list) + r")"
df2["citizenship"] = df2["Country_Name"].str.extract(citizenship_regex).iloc[:, 0]
joined_df = df.merge(df2, on=["citizenship"], how="left")
print(joined_df)
Then you can reduce this to select just the columns you want.
Also, you probably want to clean both the citizenship
and Country_Name
columns by running df['citizenship'] = df['citizenship'].str.lower()
on them so that you don't missing something due to case.
Upvotes: 1
Reputation: 1263
I can see two issues with the code in your question:
df2[]
bit in the second positional argument to df2.loc
, to leave the column name as a string: df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
So this works for example:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria', 'Andorra', 'Bahrain', 'Spain'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
def get_continent(x):
return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
df['Continent_Name'] = df['citizenship'].apply(get_continent)
# citizenship Continent_Name
# 0 Algeria Africa
# 1 Andorra Europe
# 2 Bahrain Asia
# 3 Spain Europe
If you want to get the original code to work, you could put in a try/except:
df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'],
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})
def get_continent(x):
try:
return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
except IndexError:
return None
df['Continent_Name'] = df['citizenship'].apply(get_continent)
# citizenship Continent_Name
# 0 Algeria Africa
# 1 Andorra Europe
# 2 Bahrain Asia
# 3 Spain None
Upvotes: 1