Reputation: 4100
Sample DF:
ID Match1 Match2 Match3 Match4 Match5
1 Yes No Yes Yes Yes
2 Yes No Yes Yes No
2 Yes No No Yes Yes
3 No Yes Yes Yes No
3 No Yes No No No
4 Yes No Yes No No
4 Yes No Yes Yes Yes
Expected DF:
ID Match1 Match2 Match3 Match4 Match5 Final_Match
1 Yes No Yes Yes Yes Clear
2 Yes No Yes Yes No Unclear
2 Yes No No Yes Yes Unclear
3 No Yes Yes Yes No Clear
3 No Yes No No No Unclear
4 Yes No Yes No No Unclear
4 Yes No Yes Yes Yes Clear
Problem Statement:
Clear
in Final_Match
column (Example ID 1)If the ID's are repetitive then within an ID count Yes
in Match1 to Match5 columns, whichever has greater "Yes" put Clear
for that one and Unclear
for the other (Example ID 3 & 4
If the ID's are repetitive then within an ID count Yes
in Match1 to Match5 columns,if they have equal "Yes" put Unclear
in both (Example ID 2)
I couldn't find anything on how to solve within ID ?
Upvotes: 3
Views: 463
Reputation: 18647
You could also achieve this by using Groupby.rank
:
# Helper Series
s = (df.replace({'Yes': 1, 'No': 0})
.iloc[:, 1:]
.sum(1))
df['final_match'] = np.where(s.groupby(df['ID']).rank(ascending=False).eq(1), 'Clear', 'Unclear')
Upvotes: 2
Reputation: 29742
Using pandas.DataFrame.groupby
:
final_match = []
for i, d in df.groupby('ID'):
if len(d) == 1:
final_match.append('Clear')
else:
counter = (d.filter(like='Match') == 'Yes').sum(1)
if counter.nunique() == 1:
final_match.extend(['Unclear'] * len(d))
else:
final_match.extend(counter.apply(lambda x: 'Clear' if x == max(counter) else 'Unclear').tolist())
df['final_match'] = final_match
print(df)
ID Match1 Match2 Match3 Match4 Match5 final_match
0 1 Yes No Yes Yes Yes Clear
1 2 Yes No Yes Yes No Unclear
2 2 Yes No No Yes Yes Unclear
3 3 No Yes Yes Yes No Clear
4 3 No Yes No No No Unclear
5 4 Yes No Yes No No Unclear
6 4 Yes No Yes Yes Yes Clear
Explanation:
len(d) == 1
: if non-repetitive, add Clear
counter = (d.filter(like='Match') == 'Yes').sum(1)
: Counts the number of 'Yes' in each columncounter.nunique() == 1
: if all rows have the same number of 'Yes's, then all are marked as 'Unclear'counter.apply(lambda x: 'Clear' if x == max(counter) else 'Unclear').tolist()
: if rows have different counts of 'Yes', mark the highest with 'Clear', rest with 'Unclear'Upvotes: 1
Reputation: 75080
Another way of doing this would be:
df['sum_yes']=df.iloc[:,1:6].eq('Yes').sum(axis=1)
df['final']=df.groupby('ID')['sum_yes'].transform\
(lambda x: np.where((x==x.max())&(~x.duplicated(keep=False)),'Clear','Unclear'))
print(df)
ID Match1 Match2 Match3 Match4 Match5 sum_yes final
0 1 Yes No Yes Yes Yes 4 Clear
1 2 Yes No Yes Yes No 3 Unclear
2 2 Yes No No Yes Yes 3 Unclear
3 3 No Yes Yes Yes No 3 Clear
4 3 No Yes No No No 1 Unclear
5 4 Yes No Yes No No 2 Unclear
6 4 Yes No Yes Yes Yes 4 Clear
P.S You can drop the sum_yes
column if you want.
Upvotes: 2