Reputation: 2314
I would like to do the following:
for a data frame that looks like this:
df = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["06/24/2014","06/25/2014","06/23/2014","07/02/1999","07/02/1999"], "value": ["3","5","1","7","8"] })
I would like to group by date all those observations that are within 2 days of each other. Then, for example, the first 3 rows would be grouped and the last two would be grouped.
So far, I have thought of using something similar to:
df.groupby(df['date'].map(lambda x: x.month))
What is the general way to do this type of "fuzzy groupby"?
thank you,
Upvotes: 2
Views: 3831
Reputation: 879113
You could sort the rows by date
, then take the difference between consecutive dates.
Test when the difference is greater than 2 days. Taking the cumulative sum assigns the desired group numbers:
import pandas as pd
df = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["06/24/2014","06/25/2014","06/23/2014","07/02/1999","07/02/1999"], "value": ["3","5","1","7","8"] })
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by='date')
df['group'] = (df['date'].diff() > pd.Timedelta(days=2)).cumsum()
print(df)
yields
ID date value group
3 B 1999-07-02 7 0
4 B 1999-07-02 8 0
2 C 2014-06-23 1 1
0 A 2014-06-24 3 1
1 A 2014-06-25 5 1
Upvotes: 8