Reputation: 23
Say I have a pandas DataFrame as below:
C30 C25 C20 C15 C10
1 AJA EJE IJI OJO UJU
2 AJA EJE IJI OJO UJU
3 AJA EJE IJI OJO UJU
4 ABA EBE IBI OBO UBU
5 ABA EBE IBI OBO UBU
6 ABA EBE IBI OBO UNU
7 BOB BIB BEB BAB BUB
I would like to validate all the rows that have equal values in C30 have also equal values in C25, C20, C15 & C10.
I have approximately 60,000 rows with a lot of variation in the column values, if elements in column C30 are equal then the corresponding column C25 elements should also be equal and so on.
For the example, validation should shows a mismatch because C10 value in row 6 is not equal to 4 & 5 rows. What's the most efficient way to do this?
Upvotes: 2
Views: 449
Reputation: 10960
Use DataFrame.duplicated
to find rows that are not unique.
df['is_dup'] = df.duplicated(keep=False)
Output
C30 C25 C20 C15 C10 is_dup
1 AJA EJE IJI OJO UJU True
2 AJA EJE IJI OJO UJU True
3 AJA EJE IJI OJO UJU True
4 ABA EBE IBI OBO UBU True
5 ABA EBE IBI OBO UBU True
6 ABA EBE IBI OBO UNU False
7 BOB BIB BEB BAB BUB False
Upvotes: 1
Reputation: 862511
You can use DataFrameGroupBy.nunique
and compare for not equal 1
by DataFrame.ne
:
m = df.groupby('C30').nunique().ne(1)
print (m)
C25 C20 C15 C10
C30
ABA False False False True
AJA False False False False
BOB False False False False
And then if need values if no matched use np.where
:
i, c = np.where(m)
L = list(zip(m.index[i], m.columns[c]))
if len(L) > 0:
print ('mismatch')
print (L)
mismatch
[('ABA', 'C10')]
Upvotes: 1