Reputation: 3594
I am trying to perform a conditional check on a dataframe with two columns as follows: content of either column cannot be in the other column unless both values are equal - there can be no instance where a value is present in both columns and the values are not equal.
In the e.g. below, the condition is met. Though alternahaircare.com is in both columns - domain and credit_domain, the only case where that happens, both values are equal.
Out[198]: df
domain credit_domain
2 alternahaircare.com alternahaircare.com
3 alternahaircare.myshopify.com alternahaircare.com
4 shop.alternahaircare.com alternahaircare.com
5 americancrew.com americancrew.com
Below are two modified examples of the above where the conditions I want to satisfy is not met:
In this one, you can see that there is a row where domain = alternahaircare and credit_domain = americancrew.com. This is wrong - now alternahaircare.com occurs in both columns but there is an instance where the value in the credit_domain column does not match domain column.
Out[199]: df
domain credit_domain
2 alternahaircare.com americancrew.com
3 alternahaircare.myshopify.com alternahaircare.com
4 shop.alternahaircare.com alternahaircare.com
5 americancrew.com americancrew.com
The below instance is also wrong - does not meet the conditions i want to satisfy. Though there is an instance of alternahaircare.com == alternahaircare.com, there is also an instance of alternahaircare.com being assigned to a value that is not itself i.e. americacrew.com
Out[198]: df
domain credit_domain
2 alternahaircare.com alternahaircare.com
3 alternahaircare.myshopify.com alternahaircare.com
4 shop.alternahaircare.com alternahaircare.com
5 alternahaircare.com americancrew.com
The second of of these problem can be addressed easily by checking the count of domain values and identifying anything that occurs twice.
The first problem has me confused. My approach was to identify the potential problem rows and then try using a groupby and a conditional statement:
potential_error_rows = df[df.credit_domain.isin([x for x in df.credit_domain.tolist() if x in df.domain.tolist()])]
Couldn't think of what to do beyond this!
Upvotes: 2
Views: 1506
Reputation: 3594
I was quite close already and found the answer:
# This gives the suspect rows - rows of data where domain value is same as the list of credit_domain present in domain
x = df[df.domain.isin([x for x in df.credit_domain.tolist() if x in df.domain.tolist()])]
# Once I have the above, it was a simple matter of looking at those rows where domain != credit_domain
x[x.domain != x.credit_domain]
Upvotes: 0
Reputation: 29690
If I understand the mapping correctly that you want between columns, you can simply check if there are any rows in domain
which correspond to both
credit_domain
credit_domain
. Like so
def is_valid(df):
return ~(df.domain.isin(df.credit_domain) &
(df.domain != df.credit_domain)).any()
Which returns True
, False
, and False
for your example DataFrames.
Upvotes: 1