Groupby and find similar or same items in two columns in Python

Question

For a data frame as follows, if the string in name2 is approximately similar or same as the string in name1 in each group of type, then return Y, otherwise N.

   id type     name1      name2
0   1    A  James B.      James
1   2    A     Keras     Steven
2   3    A       NaN      Keras
3   4    B      Jack       Lucy
4   5    B      Lucy       Jack
5   6    C    Jasica     Hoverd
6   7    C    Steven     Jasica
7   8    C       NaN  Steven L.

The expected result is like this, for example, in type A, James from name2 have a similar value James B. in name1, Keras has same values in both name2 and name1, so both of them return Y in result. While Steven does not exist in name1, so return N.

   id type     name1      name2 result
0   1    A  James B.      James      Y
1   2    A     Keras     Steven      N
2   3    A       NaN      Keras      Y
3   4    B      Jack       Lucy      Y
4   5    B      Lucy       Jack      Y
5   6    C    Jasica     Hoverd      N
6   7    C    Steven     Jasica      Y
7   8    C       NaN  Steven L.      Y

Someone could help to do that? Thank you.

If find similar values is too complicated to realize, then find only same values and return Y will be OK.

jezrael · Accepted Answer

Without similarity it is simplier:

mask = df.groupby('type', group_keys=False).apply(lambda x: x['name2'].isin(x['name1']))
df['new'] = np.where(mask, 'Y','N')
print (df)
   id type     name1      name2 new
0   1    A  James B.      James   N
1   2    A     Keras     Steven   N
2   3    A       NaN      Keras   Y
3   4    B      Jack       Lucy   Y
4   5    B      Lucy       Jack   Y
5   6    C    Jasica     Hoverd   N
6   7    C    Steven     Jasica   Y
7   8    C       NaN  Steven L.   N

With basic similarity with split:

mask = (df.assign(name1 = df['name1'].fillna('|').astype(str).str.split().str[0],
                  name2 = df['name2'].astype(str).str.split().str[0])
          .groupby('type', group_keys=False)
          .apply(lambda x: x['name2'].isin(x['name1'])))
df['new'] = np.where(mask, 'Y','N')
print (df)

   id type     name1      name2 new
0   1    A  James B.      James   Y
1   2    A     Keras     Steven   N
2   3    A       NaN      Keras   Y
3   4    B      Jack       Lucy   Y
4   5    B      Lucy       Jack   Y
5   6    C    Jasica     Hoverd   N
6   7    C    Steven     Jasica   Y
7   8    C       NaN  Steven L.   Y

For better matchin similarity is possible use SequenceMatcher for ratio and filter it by treshold, e.g. here by 0.5:

from difflib import SequenceMatcher

def f(x):
    comp = [any(SequenceMatcher(None, a, b).ratio() > .5 
                                   for a in x['name1'].fillna('_')) 
                                   for b in x['name2']]
    return pd.Series(comp, index=x.index)

mask = df.groupby('type', group_keys=False).apply(f)
df['new'] = np.where(mask, 'Y','N')
print (df)
   id type       name1      name2 new
0   1    A    James B.      James   Y
1   2    A       Keras     Steven   N
2   3    A         NaN      Keras   Y
3   4    B        Jack       Lucy   Y
4   5    B        Lucy       Jack   Y
5   6    C      Jasica     Hoverd   N
6   7    C  Steven LA.     Jasica   Y
7   8    C         NaN  Steven L.   Y

Groupby and find similar or same items in two columns in Python

Answers (2)

Related Questions