Reputation: 749
I am struggling with the easiest way to do a case insensitive merge in pandas. Is there a way to do it right on the merge? Do I need to use (?i) or a regex with ignorecase? In my code snippet below I am joining some Countries where it may be "United States" in one file and "UNITED STATES" in another and I just want to take the case out of the equation. Thank you!
import pandas as pd
import csv
import sys
env_path = sys.argv[1]
map_path = sys.argv[2]
df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")
df_merged = df_address.merge(df_CountryMapping, left_on="Country", right_on="NAME", how="left")
....
Upvotes: 32
Views: 57114
Reputation: 501
Take each column and call str.lower()
to return a copy with all lower case values. Then pass those as the left_on
and right_on
parameters.
This approach does not modify the original DataFrames.
df_merged = pd.merge(df_address, df_CountryMapping, left_on=df_address["Country"].str.lower(), right_on=df_CountryMapping["NAME"].str.lower(), how="left")
Upvotes: 50
Reputation: 579
Another option is with ".str.casefold()" for a more comprehensive incorporation of ASCII and different language characters. If your just using English alpha chars it should be the same as ".str.lower()"
df_address['country_casefolded'] = df_address['Country'].str.casefold()
df_CountryMapping['name_casefolded'] = df_CountryMapping['NAME'].str.casefold()
df_merged = df_address.merge(df_CountryMapping, left_on="country_casefolded", right_on="name_casefolded", how="left")
Upvotes: 1
Reputation: 643
One solution would be to convert the column names of both data frames to be all lowercase. So something like this:
df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")
df_address.rename(columns=lambda x: x.lower(), inplace=True)
df_CountryMapping.rename(columns=lambda x: x.lower(), inplace=True)
df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")
Upvotes: 1
Reputation: 13690
I suggest lowering the column names after reading them
df_address.columns=[c.lower() for c in df_address.columns]
df_CountryMapping.columns=[c.lower() for c in df_CountryMapping.columns]
Then update the values
df_address['country']=df_address['country'].str.lower()
df_CountryMapping['name']=df_CountryMapping['name'].str.lower()
And only then, do the merging
df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")
Upvotes: 4
Reputation: 2804
Lowercase the values in the two columns that will be used to merge, and then merge on the lowercased columns
df_address['country_lower'] = df_address['Country'].str.lower()
df_CountryMapping['name_lower'] = df_CountryMapping['NAME'].str.lower()
df_merged = df_address.merge(df_CountryMapping, left_on="country_lower", right_on="name_lower", how="left")
Upvotes: 38