Pirate Prentice
Pirate Prentice

Reputation: 123

How to filter by a condition on a given level in a Pandas Multiindex

Suppose df is the following multi-indexed dataframe:

letter      date        number
x           2021-01-01  0             0
                        1             1
            2021-01-02  0             2
                        1             3
            2021-01-04  0             4
                        1             5
y           2021-02-07  0             6
                        1             7
            2021-02-09  0             8
                        1             9

For each letter, I wish to select only those rows where the date is no more than 2 days after the first.

For x, the first date is 2021-01-01, so all the rows where the date is after 2021-01-03 should be removed.

For y, the first date is 2021-02-07, so all the rows where the date is after 2021-01-09 should be removed.

The desired outcome would be:

letter      date        number
x           2021-01-01  0             0
                        1             1
            2021-01-02  0             2
                        1             3
y           2021-02-07  0             6
                        1             7

What's the best way to do this?

Upvotes: 2

Views: 54

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35636

That actually looks like a series, so can create a boolean index with groupby transform and filter dates that are less than 2 days from the first value:

df = df[
    df.reset_index('date')
        .groupby('letter')['date']
        .transform(lambda x: x < (x.iat[0] + pd.Timedelta(days=2))).values
]
letter  date        number
x       2021-01-01  0         0
                    1         1
        2021-01-02  0         2
                    1         3
y       2021-02-07  0         6
                    1         7
dtype: int64

For each group, transform date column into boolean values based on the comparison to min date offset by 2 days.

df.reset_index('date')
    .groupby('letter')['date']
    .transform(lambda x: x < (x.iat[0] + pd.Timedelta(days=2)))

Filter:

0     True
1     True
2     True
3     True
4    False
5    False
6     True
7     True
8    False
9    False
Name: date, dtype: bool

Can also use min to get distance from earliest date regardless of order:

df = df[
    df.reset_index('date')
        .groupby('letter')['date']
        .transform(lambda x: x < (x.min() + pd.Timedelta(days=2))).values
]

Series/df creation code:

import pandas as pd

df = pd.DataFrame({
    'letter': ['x', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],
    'date': pd.to_datetime(['2021-01-01', '2021-01-01', '2021-01-02',
                            '2021-01-02', '2021-01-04', '2021-01-04',
                            '2021-02-07', '2021-02-07', '2021-02-09',
                            '2021-02-09']),
    'number': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
    'value': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
}).set_index(['letter', 'date', 'number']).squeeze().rename(None)

Upvotes: 1

Related Questions