Reputation: 123
Suppose df
is the following multi-indexed dataframe:
letter date number
x 2021-01-01 0 0
1 1
2021-01-02 0 2
1 3
2021-01-04 0 4
1 5
y 2021-02-07 0 6
1 7
2021-02-09 0 8
1 9
For each letter, I wish to select only those rows where the date is no more than 2 days after the first.
For x
, the first date is 2021-01-01
, so all the rows where the date is after 2021-01-03
should be removed.
For y
, the first date is 2021-02-07
, so all the rows where the date is after 2021-01-09
should be removed.
The desired outcome would be:
letter date number
x 2021-01-01 0 0
1 1
2021-01-02 0 2
1 3
y 2021-02-07 0 6
1 7
What's the best way to do this?
Upvotes: 2
Views: 54
Reputation: 35636
That actually looks like a series, so can create a boolean index with groupby transform
and filter dates that are less than 2 days from the first value:
df = df[
df.reset_index('date')
.groupby('letter')['date']
.transform(lambda x: x < (x.iat[0] + pd.Timedelta(days=2))).values
]
letter date number
x 2021-01-01 0 0
1 1
2021-01-02 0 2
1 3
y 2021-02-07 0 6
1 7
dtype: int64
For each group, transform date
column into boolean values based on the comparison to min
date offset by 2 days.
df.reset_index('date')
.groupby('letter')['date']
.transform(lambda x: x < (x.iat[0] + pd.Timedelta(days=2)))
Filter:
0 True
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 False
9 False
Name: date, dtype: bool
Can also use min
to get distance from earliest date regardless of order:
df = df[
df.reset_index('date')
.groupby('letter')['date']
.transform(lambda x: x < (x.min() + pd.Timedelta(days=2))).values
]
Series/df creation code:
import pandas as pd
df = pd.DataFrame({
'letter': ['x', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],
'date': pd.to_datetime(['2021-01-01', '2021-01-01', '2021-01-02',
'2021-01-02', '2021-01-04', '2021-01-04',
'2021-02-07', '2021-02-07', '2021-02-09',
'2021-02-09']),
'number': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
'value': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
}).set_index(['letter', 'date', 'number']).squeeze().rename(None)
Upvotes: 1