Emil Jansson
Emil Jansson

Reputation: 159

Filter pandas frame based on example row with "wildcard" values

I have a dataframe and a filter I want to apply to the frame in the form of a series. The filtered dataframe should include all rows that match the filter. Where the filter has a "wildcard-value", defined in this case as NaN, everything is considered a match.

Below is my implementation of such a filter:

df: pandas.DataFrame
f: pandas.Series

def match(row: pandas.Series, f: pandas.Series):
    return all([isinstance(value, float) and math.isnan(value) or value == row[idx] 
                for idx, value in zip(f.index, f)])

filtered_df = df[[match(row, f) for _, row in df.iterrows()]]

It does the job, but it's not as elegant as I would like and might be to slow for large df. I have heard that iterating over pandas frames is frowned upon and am therefore looking for a better solution.

How can one write this code in a better way?

Update with runnable code:

import math
import pandas

if __name__ == '__main__':
    data = {'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka'],
            'Age': [21, 19, 19, 19],
            'Stream': ['Math', 'Commerce', 'Arts', 'Biology'],
            'Percentage': [88, 88, 88, 70]}

    df = pandas.DataFrame(data, columns=['Name', 'Age', 'Stream', 'Percentage'])

    f = pandas.Series([math.nan, 19, math.nan, 88], index=['Name', 'Age', 'Stream', 'Percentage'])


    def match(row: pandas.Series, f: pandas.Series):
        return all([isinstance(value, float) and math.isnan(value) or value == row[idx]
                    for idx, value in zip(f.index, f)])


    filtered_df = df[[match(row, f) for _, row in df.iterrows()]]

    print(filtered_df)

Upvotes: 0

Views: 150

Answers (1)

Alessandro
Alessandro

Reputation: 381

You could try to use an inner join to keep only the relevant rows, like this example:

# Remove indexes without condition
f = f.dropna()

# Move the series into a DataFrame (T needed to transpose)
f_frame = f.to_frame().T

# Perform inner join
filtered_df = df.merge(f_frame, how='inner', on=list(f_frame.columns))

Upvotes: 2

Related Questions