Reputation: 61
I have a merged dataset-which now i need to filter.
The merged dataset looks like this
Flower Id city Color Flower_y City_y Color_y
Jasmine 1023LD Hawai White Jasmine Hawai White
Jasmine 1023LD Hawai White Jasmine Hawai Yellow
Jasmine 1023LD Hawai White Jasmine Hawai Orange
Lily 2457MH Washington Purple Lily Washington Yellow
Lily 2457MH Washington Purple Lily Washington Orange
Lily 2457MH Washington Purple Lily Washington Red
I need to filter this and get the row where color and color_y doesnt match, If there is atleast one row where the colors match then the whole row shouldnt be returned. In above example, none of the Jasmine rows should be returned, as 1 row matches between color and color_y. But Lily row has to be returned, Result dataframe should look like this.
Flower Id city Color
Lily 2457MH Washington Purple
How do I achieve this ?
Thank you !!
Upvotes: 0
Views: 61
Reputation: 14949
you can use the filter
. Then filter the required columns
.
df = df.groupby('Flower').filter(lambda x : not (any(x['Color'] == x['Color_y'])))
use_cols = [col for col in df.columns if not col.endswith('_y')]
df = df[use_cols].drop_duplicates()
Upvotes: 3
Reputation: 35676
If you need the resulting DataFrame exactly as in your question, to get only the first row from a group that should be kept, apply and return None
if there is any matching, otherwise give back head(1)
:
df = df.groupby('Flower').apply(
lambda g: None if g['Color'].eq(g['Color_y']).any() else g.head(1)
).reset_index(drop=True).filter(regex='.*(?<!_y)$')
df:
Flower Id city Color
0 Lily 2457MH Washington Purple
Edit: Explanation about Boolean Indexing
For a given frame, a series of Booleans can be generated in a variety of ways:
Given the first group as g:
Flower Id city Color Flower_y City_y Color_y
0 Jasmine 1023LD Hawai White Jasmine Hawai White
1 Jasmine 1023LD Hawai White Jasmine Hawai Yellow
2 Jasmine 1023LD Hawai White Jasmine Hawai Orange
print(g['Color'].eq(g['Color_y']))
0 True
1 False
2 False
dtype: bool
Index 0 meets the condition so it is True
. The other two do not so they are False
. This gets resolved to a single variable through any()
. Which is True
if any in the series are True
.
print(g['Color'].eq(g['Color_y']).any())
True
Conditions can be combined in a variety of ways including & and |.
Checking if the colors match OR Color_y endswith 'e', for example:
print((g['Color'].eq(g['Color_y']) | (g['Color_y'].str.endswith('e'))))
0 True
1 False
2 True
dtype: bool
And like before, any()
or all()
can be used on the Boolean series to resolve it to a single Boolean value.
Upvotes: 1