Reputation: 113

Check if column value is in other columns in pandas ignroing special characters and the size of characters

There is a slight variant to this question Check if column value is in other columns in pandas

I have a dataframe called test

name_0        name_1    overall_name
Asda          Nan       Tesco
Asda          Nan       ASDA
LIDL 1        Asda      Lidl
AAA           Asda      ASDA
AAA           Asda      ASDA
Sainsbury     Nan       Lidl

How do I check whether test.overall_name is in any of the other columns ['name_0', 'name_1' etc] ignoring the size of characters (lowercase/uppercase) and any special characters.

So my ideal dataframe should look like:

name_0        name_1    overall_name   namematch 
Asda          Nan       Tesco          no match 
Asda          Nan       ASDA           match
LIDL 1        Asda      Lidl           match
AAA           Asda      ASDA           match
AAA           Asda      ASDA           match
Sainsbury     Nan       Lidl           no match

Upvotes: 1

Answers (3)

Nev1111

Reputation: 1049

Re-create the sample data frame:

df=pd.DataFrame({'name_0':['Asda','AS-DA','Asda','LIDL1','AAA','AAA','Sainsbury'],
                 'name_1':[np.nan,np.nan,'Asda','As da','Asda','Asda',np.nan],
                 'overall_name':['Tesco','ASDA','Lidl1','ASDA','ASDA','Lid1','As da']})

Convert float to string:

df=df.fillna('nan')

Remove special characters: '-' and ' ', note: need to import 'regex' lib

import re

df = df.applymap(lambda x: re.sub(r'-','', x))
df=df.applymap(lambda x: re.sub(r' ','',x))

Create a list:

name_0=df['name_0'].tolist()
name_1=df['name_1'].tolist()
name_concat=name_0+name_1

Obtain results:

df['namematch']=df['overall_name'].str.lower().isin([x.lower() for x in name_concat])
df['namematch']=np.where(df['namematch']==True,'match','nomatch')

Upvotes: 0

wwnde

Reputation: 26676

IICU

Make df have a common case. Then use boolean indexing combined with np.where to check and attribute

Using boolean indexing combined with np.where check and attribute

Dataframe Used

df["namematch "] = np.where((df.drop("overall_name", 1).apply(lambda x:x.str.lower())).isin(df["overall_name"].str.lower()).any(1),'match','nomatch')

#Make df have a common case. In this situation made it lower case
#df=df.apply(lambda x:x.str.lower())
# df["namematch "] = np.where(df.drop("overall_name", 1).isin(df["overall_name"]).any(1),'match','nomatch')

Outcome

Upvotes: 0

Hamid

Reputation: 612

Check this out:

This method converts and compares the values:

import pandas as pd 
import re

def match (first, second, overall):
    f = re.sub(r"[^a-zA-Z]"," ", first.lower()).strip()
    s = re.sub(r"[^a-zA-Z]"," ", second.lower()).strip()
    o = re.sub(r"[^a-zA-Z]"," ", overal.lower()).strip()
    if f == o:
        return 1
    elif s == o:
        return 1
    else:
        return 0

This line of code adds match column and applys the function to each row:

df['match'] = df.apply(lambda x: match(x['name_0'],x['name_1'],x['overall_name']),axis=1)

the result is something like this:

    name_0  name_1  overall_name    match
  0 Asda    Nan     Tesco             0
  1 Asda    Nan     ASDA              1
  2 LIDL 1  Asda    Lidl              1
  3 AAA     Asda    ASDA              1
  4 AAA     Asda    ASDA              1
  5 Sainsbury   Nan Lidl              0

let me know if it works for you.

Upvotes: 1

Check if column value is in other columns in pandas ignroing special characters and the size of characters

Answers (3)

Related Questions