Rahul Agarwal
Rahul Agarwal

Reputation: 4100

Match values of multiple columns within an ID

Sample DF:

ID     Match1        Match2        Match3     Match4       Match5
1      Yes           No            Yes        Yes          Yes
2      Yes           No            Yes        Yes          No
2      Yes           No            No         Yes          Yes
3      No            Yes           Yes        Yes          No
3      No            Yes           No         No           No
4      Yes           No            Yes        No           No
4      Yes           No            Yes        Yes          Yes

Expected DF:

 ID     Match1     Match2        Match3     Match4    Match5 Final_Match
    1      Yes      No            Yes        Yes      Yes     Clear
    2      Yes      No            Yes        Yes      No      Unclear
    2      Yes      No            No         Yes      Yes     Unclear
    3      No       Yes           Yes        Yes      No      Clear
    3      No       Yes           No         No       No      Unclear
    4      Yes      No            Yes        No       No      Unclear
    4      Yes      No            Yes        Yes      Yes     Clear

Problem Statement:

  1. If ID is non-repetitive simply put Clear in Final_Match column (Example ID 1)
  2. If the ID's are repetitive then within an ID count Yes in Match1 to Match5 columns, whichever has greater "Yes" put Clear for that one and Unclear for the other (Example ID 3 & 4

  3. If the ID's are repetitive then within an ID count Yes in Match1 to Match5 columns,if they have equal "Yes" put Unclear in both (Example ID 2)

I couldn't find anything on how to solve within ID ?

Upvotes: 3

Views: 463

Answers (3)

Chris Adams
Chris Adams

Reputation: 18647

You could also achieve this by using Groupby.rank:

# Helper Series
s = (df.replace({'Yes': 1, 'No': 0})
     .iloc[:, 1:]
     .sum(1))

df['final_match'] = np.where(s.groupby(df['ID']).rank(ascending=False).eq(1), 'Clear', 'Unclear')

Upvotes: 2

Chris
Chris

Reputation: 29742

Using pandas.DataFrame.groupby:

final_match = []
for i, d in df.groupby('ID'):
    if len(d) == 1:
        final_match.append('Clear')
    else:
        counter = (d.filter(like='Match') == 'Yes').sum(1)
        if counter.nunique() == 1:
            final_match.extend(['Unclear'] * len(d))
        else:
            final_match.extend(counter.apply(lambda x: 'Clear' if x == max(counter) else 'Unclear').tolist())
df['final_match'] = final_match

print(df)
   ID Match1 Match2 Match3 Match4 Match5 final_match
0   1    Yes     No    Yes    Yes    Yes       Clear
1   2    Yes     No    Yes    Yes     No     Unclear
2   2    Yes     No     No    Yes    Yes     Unclear
3   3     No    Yes    Yes    Yes     No       Clear
4   3     No    Yes     No     No     No     Unclear
5   4    Yes     No    Yes     No     No     Unclear
6   4    Yes     No    Yes    Yes    Yes       Clear

Explanation:

  • len(d) == 1: if non-repetitive, add Clear
  • counter = (d.filter(like='Match') == 'Yes').sum(1): Counts the number of 'Yes' in each column
  • counter.nunique() == 1: if all rows have the same number of 'Yes's, then all are marked as 'Unclear'
  • counter.apply(lambda x: 'Clear' if x == max(counter) else 'Unclear').tolist(): if rows have different counts of 'Yes', mark the highest with 'Clear', rest with 'Unclear'

Upvotes: 1

anky
anky

Reputation: 75080

Another way of doing this would be:

df['sum_yes']=df.iloc[:,1:6].eq('Yes').sum(axis=1)
df['final']=df.groupby('ID')['sum_yes'].transform\
             (lambda x: np.where((x==x.max())&(~x.duplicated(keep=False)),'Clear','Unclear'))
print(df)

   ID Match1 Match2 Match3 Match4 Match5  sum_yes    final
0   1    Yes     No    Yes    Yes    Yes        4    Clear
1   2    Yes     No    Yes    Yes     No        3  Unclear
2   2    Yes     No     No    Yes    Yes        3  Unclear
3   3     No    Yes    Yes    Yes     No        3    Clear
4   3     No    Yes     No     No     No        1  Unclear
5   4    Yes     No    Yes     No     No        2  Unclear
6   4    Yes     No    Yes    Yes    Yes        4    Clear

P.S You can drop the sum_yes column if you want.

Upvotes: 2

Related Questions