vagabond
vagabond

Reputation: 3594

Conditional testing and comparison of two columns in a pandas dataframe

I am trying to perform a conditional check on a dataframe with two columns as follows: content of either column cannot be in the other column unless both values are equal - there can be no instance where a value is present in both columns and the values are not equal.

In the e.g. below, the condition is met. Though alternahaircare.com is in both columns - domain and credit_domain, the only case where that happens, both values are equal.

Out[198]: df
                            domain               credit_domain
2              alternahaircare.com         alternahaircare.com
3    alternahaircare.myshopify.com         alternahaircare.com
4         shop.alternahaircare.com         alternahaircare.com
5                 americancrew.com            americancrew.com

Below are two modified examples of the above where the conditions I want to satisfy is not met:

In this one, you can see that there is a row where domain = alternahaircare and credit_domain = americancrew.com. This is wrong - now alternahaircare.com occurs in both columns but there is an instance where the value in the credit_domain column does not match domain column.

Out[199]: df
                            domain               credit_domain
2              alternahaircare.com            americancrew.com
3    alternahaircare.myshopify.com         alternahaircare.com
4         shop.alternahaircare.com         alternahaircare.com
5                 americancrew.com            americancrew.com

The below instance is also wrong - does not meet the conditions i want to satisfy. Though there is an instance of alternahaircare.com == alternahaircare.com, there is also an instance of alternahaircare.com being assigned to a value that is not itself i.e. americacrew.com

Out[198]: df
                            domain               credit_domain
2              alternahaircare.com         alternahaircare.com
3    alternahaircare.myshopify.com         alternahaircare.com
4         shop.alternahaircare.com         alternahaircare.com
5              alternahaircare.com            americancrew.com

The second of of these problem can be addressed easily by checking the count of domain values and identifying anything that occurs twice.

The first problem has me confused. My approach was to identify the potential problem rows and then try using a groupby and a conditional statement:

potential_error_rows = df[df.credit_domain.isin([x for x in df.credit_domain.tolist() if x in df.domain.tolist()])]

Couldn't think of what to do beyond this!

Upvotes: 2

Views: 1506

Answers (2)

vagabond
vagabond

Reputation: 3594

I was quite close already and found the answer:

# This gives the suspect rows - rows of data where domain value is same as the list of credit_domain present in domain
x = df[df.domain.isin([x for x in df.credit_domain.tolist() if x in df.domain.tolist()])]

# Once I have the above, it was a simple matter of looking at those rows where domain != credit_domain
x[x.domain != x.credit_domain]

Upvotes: 0

miradulo
miradulo

Reputation: 29690

If I understand the mapping correctly that you want between columns, you can simply check if there are any rows in domain which correspond to both

  • being in credit_domain
  • not being equal to their row value in credit_domain.

Like so

def is_valid(df):
    return ~(df.domain.isin(df.credit_domain) &
             (df.domain != df.credit_domain)).any()

Which returns True, False, and False for your example DataFrames.

Upvotes: 1

Related Questions