Mmenon
Mmenon

Reputation: 61

How to filter records from a merged dataframe

I have a merged dataset-which now i need to filter.

The merged dataset looks like this

Flower   Id         city        Color   Flower_y   City_y   Color_y
Jasmine 1023LD     Hawai        White   Jasmine    Hawai    White
Jasmine 1023LD     Hawai        White   Jasmine    Hawai    Yellow
Jasmine 1023LD     Hawai        White   Jasmine    Hawai    Orange
Lily    2457MH     Washington   Purple  Lily       Washington Yellow
Lily    2457MH     Washington   Purple  Lily       Washington Orange
Lily    2457MH     Washington   Purple  Lily       Washington Red

I need to filter this and get the row where color and color_y doesnt match, If there is atleast one row where the colors match then the whole row shouldnt be returned. In above example, none of the Jasmine rows should be returned, as 1 row matches between color and color_y. But Lily row has to be returned, Result dataframe should look like this.

Flower   Id         city        Color
Lily    2457MH     Washington   Purple

How do I achieve this ?

Thank you !!

Upvotes: 0

Views: 61

Answers (2)

Nk03
Nk03

Reputation: 14949

you can use the filter. Then filter the required columns.

df = df.groupby('Flower').filter(lambda x : not (any(x['Color'] == x['Color_y'])))
use_cols = [col for col in df.columns if not col.endswith('_y')]
df = df[use_cols].drop_duplicates()

Upvotes: 3

Henry Ecker
Henry Ecker

Reputation: 35676

If you need the resulting DataFrame exactly as in your question, to get only the first row from a group that should be kept, apply and return None if there is any matching, otherwise give back head(1):

df = df.groupby('Flower').apply(
    lambda g: None if g['Color'].eq(g['Color_y']).any() else g.head(1)
).reset_index(drop=True).filter(regex='.*(?<!_y)$')

df:

  Flower      Id        city   Color
0   Lily  2457MH  Washington  Purple

Edit: Explanation about Boolean Indexing

For a given frame, a series of Booleans can be generated in a variety of ways:

Given the first group as g:

    Flower      Id   city  Color Flower_y City_y Color_y
0  Jasmine  1023LD  Hawai  White  Jasmine  Hawai   White
1  Jasmine  1023LD  Hawai  White  Jasmine  Hawai  Yellow
2  Jasmine  1023LD  Hawai  White  Jasmine  Hawai  Orange
print(g['Color'].eq(g['Color_y']))
0     True
1    False
2    False
dtype: bool

Index 0 meets the condition so it is True. The other two do not so they are False. This gets resolved to a single variable through any(). Which is True if any in the series are True.

print(g['Color'].eq(g['Color_y']).any())
True

Conditions can be combined in a variety of ways including & and |.

Checking if the colors match OR Color_y endswith 'e', for example:

print((g['Color'].eq(g['Color_y']) | (g['Color_y'].str.endswith('e'))))
0     True
1    False
2     True
dtype: bool

And like before, any() or all() can be used on the Boolean series to resolve it to a single Boolean value.

Upvotes: 1

Related Questions