Reputation: 133
I have a pandas dataframe, df
, containing ID
and date
columns:
start = datetime.datetime.today()
dates = [start, start+relativedelta(days=20), start+relativedelta(days=40),
start, start+relativedelta(days=35), start+relativedelta(days=36),
start, start+relativedelta(days=10), start+relativedelta(days=15)]
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3], 'date':dates})
ID date
0 1 2018-11-29 15:35:56.876549
1 1 2018-12-19 15:35:56.876549
2 1 2019-01-08 15:35:56.876549
3 2 2018-11-29 15:35:56.876549
4 2 2019-01-03 15:35:56.876549
5 2 2019-01-04 15:35:56.876549
6 3 2018-11-29 15:35:56.876549
7 3 2018-12-09 15:35:56.876549
8 3 2018-12-14 15:35:56.876549
Now I want to filter df
so that for every ID, only the first 30 days are included. I.e. date <= (date.min() + 30 days)
This means for example ID=1, 2019-01-08 is more than 30 days after the first date, 2018-11-29, so it should be removed. And so on. The resulting new dataframe should be:
ID date
0 1 2018-11-29 15:35:56.876549
1 1 2018-12-19 15:35:56.876549
3 2 2018-11-29 15:35:56.876549
6 3 2018-11-29 15:35:56.876549
7 3 2018-12-09 15:35:56.876549
8 3 2018-12-14 15:35:56.876549
How can this be done programmatically?
Upvotes: 1
Views: 293
Reputation: 107687
Consider adding helper columns for start and end dates, then run boolean indexing for filter. Specifically, use groupby().tansform
for inline min
aggregation:
df['start_date'] = df.groupby(df['ID'])['date'].transform('min')
df['end_date'] = df['start_date'] + relativedelta(days=30)
# BOOLEAN MASK
sub_df = df[(df['date'] >= df['start_date']) & (df['date'] <= df['end_date'])]
print(sub_df)
# ID date start_date end_date
# 0 1 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 1 1 2018-12-19 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 3 2 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 6 3 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 7 3 2018-12-09 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 8 3 2018-12-14 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# WITH BETWEEN()
sub_df = df[df['date'].between(df['start_date'], df['end_date'])]
print(sub_df)
# ID date start_date end_date
# 0 1 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 1 1 2018-12-19 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 3 2 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 6 3 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 7 3 2018-12-09 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 8 3 2018-12-14 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# WITH QUERY()
sub_df = df.query('date >= start_date & date <= end_date')
print(sub_df)
# ID date start_date end_date
# 0 1 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 1 1 2018-12-19 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 3 2 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 6 3 2018-11-29 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 7 3 2018-12-09 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
# 8 3 2018-12-14 15:22:35.301788 2018-11-29 15:22:35.301788 2018-12-29 15:22:35.301788
For clean-up of helper columns:
# DROP HELPER COLUMNS
sub_df = sub_df.drop(columns=['start_date', 'end_date'])
print(sub_df)
# ID date
# 0 1 2018-11-29 15:22:35.301788
# 1 1 2018-12-19 15:22:35.301788
# 3 2 2018-11-29 15:22:35.301788
# 6 3 2018-11-29 15:22:35.301788
# 7 3 2018-12-09 15:22:35.301788
# 8 3 2018-12-14 15:22:35.301788
Upvotes: 2