Reputation: 10051
For a data frame as follows, if the string in name2
is approximately similar or same as the string in name1
in each group of type
, then return Y
, otherwise N
.
id type name1 name2
0 1 A James B. James
1 2 A Keras Steven
2 3 A NaN Keras
3 4 B Jack Lucy
4 5 B Lucy Jack
5 6 C Jasica Hoverd
6 7 C Steven Jasica
7 8 C NaN Steven L.
The expected result is like this, for example, in type
A
, James
from name2
have a similar value James B.
in name1
, Keras
has same values in both name2
and name1
, so both of them return Y
in result
. While Steven
does not exist in name1
, so return N
.
id type name1 name2 result
0 1 A James B. James Y
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven Jasica Y
7 8 C NaN Steven L. Y
Someone could help to do that? Thank you.
If find similar values is too complicated to realize, then find only same values and return Y
will be OK.
Upvotes: 1
Views: 392
Reputation: 863501
Without similarity it is simplier:
mask = df.groupby('type', group_keys=False).apply(lambda x: x['name2'].isin(x['name1']))
df['new'] = np.where(mask, 'Y','N')
print (df)
id type name1 name2 new
0 1 A James B. James N
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven Jasica Y
7 8 C NaN Steven L. N
With basic similarity with split:
mask = (df.assign(name1 = df['name1'].fillna('|').astype(str).str.split().str[0],
name2 = df['name2'].astype(str).str.split().str[0])
.groupby('type', group_keys=False)
.apply(lambda x: x['name2'].isin(x['name1'])))
df['new'] = np.where(mask, 'Y','N')
print (df)
id type name1 name2 new
0 1 A James B. James Y
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven Jasica Y
7 8 C NaN Steven L. Y
For better matchin similarity is possible use SequenceMatcher
for ratio and filter it by treshold, e.g. here by 0.5
:
from difflib import SequenceMatcher
def f(x):
comp = [any(SequenceMatcher(None, a, b).ratio() > .5
for a in x['name1'].fillna('_'))
for b in x['name2']]
return pd.Series(comp, index=x.index)
mask = df.groupby('type', group_keys=False).apply(f)
df['new'] = np.where(mask, 'Y','N')
print (df)
id type name1 name2 new
0 1 A James B. James Y
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven LA. Jasica Y
7 8 C NaN Steven L. Y
Upvotes: 2
Reputation: 3770
df['result'] = pd.DataFrame(df.groupby('type').apply(lambda x: ['Y' if i in ' '.join(x['name1'].astype(str)) else 'N' for i in list(x['name2'].dropna().str.split().str[0])]).tolist()).stack().reset_index(drop=True)
Output
id type name1 name2 result
0 1 A James B. James Y
1 2 A Keras Steven N
2 3 A NaN Keras Y
3 4 B Jack Lucy Y
4 5 B Lucy Jack Y
5 6 C Jasica Hoverd N
6 7 C Steven Jasica Y
7 8 C NaN Steven L. Y
Upvotes: 2