buhtz
buhtz

Reputation: 12202

Compare or diff two pandas columns element wise

I am new to Pandas (but not to data science and Python). This question is not anly about how to solve this specific problem but how to handle problems like this the panda-way.

Please feel free to improve the title of that question. Because I am not sure what are the correct terms here.

Here is my MWE

#!/usr/bin/env python3

import pandas as pd

data = {'A': [1, 2, 3, 3, 1, 4],
        'B': ['One', 'Two', 'Three', 'Three', 'Eins', 'Four']}

df = pd.DataFrame(data)

print(df)

Resulting in

   A      B
0  1    One
1  2    Two
2  3  Three
3  3  Three
4  1   Eins
5  4   Four

My assumption is that when the value in A column is 1 that the value in B column is always One. And so on...

I want to proof that assumption.

Secondary I also assume that if my first assumption is incorrect that this is not an error but there are valid (human) reasons for that. e.g. see row index 4 where the A-value is related to Eins (and not One) in the B column.

Because of that I also need to see and explore the cases where my assumption is incorrect.

Update of the question: This data is only an example. In real world I am not aware of the pairing of the two columns. Because of that solutions like this do not work in my case

df.loc[df['A'] == 1, 'B']

I do not know how many and which expressions are in column A.

I do not know how to do that with pandas. How would a panda professional would solve this?

My approach would be to use pure Python code with list(), set() and some iterations. ;)

Upvotes: 1

Views: 631

Answers (2)

Rutger
Rutger

Reputation: 603

You can filter your data frame this way:

df.loc[df['A'] == 1, 'B']

This gives you the values of B where A is 1. Next you can add an equals statement:

df.loc[df['A'] == 1, 'B'] == 'One'

Which results in a boolean series (True, False in this case). If you want to check if all are true, you add:

all(df.loc[df['A'] == 1, 'B'] == 'One')

And the answer is False because of the Eins.

EDIT

If you want to create a new column which says if your criterion is met (always the same value for B if A) then you can do this:

df['C'] = df['A'].map(df.groupby('A')['B'].nunique() < 2)

Which results in a bool column. It creates column C by mapping the values in A in a by the list in the brackets. In between the brackets it is a groupby function of the values in A and counting the unique values in B. If that is under 2 it is unique it yields True.

Upvotes: 2

jezrael
jezrael

Reputation: 863431

If solution should be testing if only one unique value per A and return all rows which failed use DataFrameGroupBy.nunique for count unique values in GroupBy.transform for repeat aggregate values per groups, so possible filter rows which are not 1, it means there are 2 or more unique values per A:

df1 = df[df.groupby('A').B.transform('nunique').ne(1)]
print (df1)
   A     B
0  1   One
4  1  Eins

if df1.empty:
    print ('My assumption is good')
else:
    print ('My assumption is wrong')
    print (df1)

Upvotes: 0

Related Questions