EMC
EMC

Reputation: 749

Case insensitive pandas dataframe.merge

I am struggling with the easiest way to do a case insensitive merge in pandas. Is there a way to do it right on the merge? Do I need to use (?i) or a regex with ignorecase? In my code snippet below I am joining some Countries where it may be "United States" in one file and "UNITED STATES" in another and I just want to take the case out of the equation. Thank you!

import pandas as pd
import csv
import sys

env_path = sys.argv[1]
map_path = sys.argv[2]


df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")

df_merged = df_address.merge(df_CountryMapping, left_on="Country", right_on="NAME", how="left")

....

Upvotes: 32

Views: 57114

Answers (5)

dattatreya moganti
dattatreya moganti

Reputation: 501

Take each column and call str.lower() to return a copy with all lower case values. Then pass those as the left_on and right_on parameters. This approach does not modify the original DataFrames.

df_merged = pd.merge(df_address, df_CountryMapping, left_on=df_address["Country"].str.lower(), right_on=df_CountryMapping["NAME"].str.lower(), how="left")

Upvotes: 50

Lelouch
Lelouch

Reputation: 579

Another option is with ".str.casefold()" for a more comprehensive incorporation of ASCII and different language characters. If your just using English alpha chars it should be the same as ".str.lower()"

df_address['country_casefolded'] = df_address['Country'].str.casefold()
df_CountryMapping['name_casefolded'] = df_CountryMapping['NAME'].str.casefold()
df_merged = df_address.merge(df_CountryMapping, left_on="country_casefolded", right_on="name_casefolded", how="left")

Upvotes: 1

mway
mway

Reputation: 643

One solution would be to convert the column names of both data frames to be all lowercase. So something like this:

df_address = pd.read_csv(env_path + "\\address.csv")
df_CountryMapping = pd.read_csv(map_path + "\CountryMapping.csv")

df_address.rename(columns=lambda x: x.lower(), inplace=True)
df_CountryMapping.rename(columns=lambda x: x.lower(), inplace=True)

df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")

Upvotes: 1

Uri Goren
Uri Goren

Reputation: 13690

I suggest lowering the column names after reading them

df_address.columns=[c.lower() for c in df_address.columns]
df_CountryMapping.columns=[c.lower() for c in df_CountryMapping.columns]

Then update the values

df_address['country']=df_address['country'].str.lower()
df_CountryMapping['name']=df_CountryMapping['name'].str.lower()

And only then, do the merging

df_merged = df_address.merge(df_CountryMapping, left_on="country", right_on="name", how="left")

Upvotes: 4

Shashank Agarwal
Shashank Agarwal

Reputation: 2804

Lowercase the values in the two columns that will be used to merge, and then merge on the lowercased columns

df_address['country_lower'] = df_address['Country'].str.lower()
df_CountryMapping['name_lower'] = df_CountryMapping['NAME'].str.lower()
df_merged = df_address.merge(df_CountryMapping, left_on="country_lower", right_on="name_lower", how="left")

Upvotes: 38

Related Questions