XaHUaR
XaHUaR

Reputation: 23

Validate rows from a Pandas dataframe are equal between values in columns

Say I have a pandas DataFrame as below:

   C30    C25    C20    C15    C10
1  AJA    EJE    IJI    OJO    UJU
2  AJA    EJE    IJI    OJO    UJU
3  AJA    EJE    IJI    OJO    UJU
4  ABA    EBE    IBI    OBO    UBU
5  ABA    EBE    IBI    OBO    UBU
6  ABA    EBE    IBI    OBO    UNU
7  BOB    BIB    BEB    BAB    BUB

I would like to validate all the rows that have equal values in C30 have also equal values in C25, C20, C15 & C10.

I have approximately 60,000 rows with a lot of variation in the column values, if elements in column C30 are equal then the corresponding column C25 elements should also be equal and so on.

For the example, validation should shows a mismatch because C10 value in row 6 is not equal to 4 & 5 rows. What's the most efficient way to do this?

Upvotes: 2

Views: 449

Answers (2)

Vishnudev Krishnadas
Vishnudev Krishnadas

Reputation: 10960

Use DataFrame.duplicated to find rows that are not unique.

df['is_dup'] = df.duplicated(keep=False)

Output

   C30  C25  C20  C15  C10  is_dup
1  AJA  EJE  IJI  OJO  UJU    True
2  AJA  EJE  IJI  OJO  UJU    True
3  AJA  EJE  IJI  OJO  UJU    True
4  ABA  EBE  IBI  OBO  UBU    True
5  ABA  EBE  IBI  OBO  UBU    True
6  ABA  EBE  IBI  OBO  UNU   False
7  BOB  BIB  BEB  BAB  BUB   False

Upvotes: 1

jezrael
jezrael

Reputation: 862511

You can use DataFrameGroupBy.nunique and compare for not equal 1 by DataFrame.ne:

m = df.groupby('C30').nunique().ne(1)
print (m)
       C25    C20    C15    C10
C30                            
ABA  False  False  False   True
AJA  False  False  False  False
BOB  False  False  False  False

And then if need values if no matched use np.where:

i, c  = np.where(m)

L = list(zip(m.index[i], m.columns[c]))

if len(L) > 0:
    print ('mismatch')
    print (L)
mismatch
[('ABA', 'C10')]
    

Upvotes: 1

Related Questions