Reputation: 49
I have a DataFrame that has students and the days they have attended their classes
Email Day
[email protected] 1
[email protected] 1
[email protected] 1
[email protected] 2
[email protected] 3
[email protected] 3
Expected Output:
Email Streak
[email protected] 1
[email protected] 1
[email protected] 3
The result must be in such a way that only those who attended the classes in a streak like day1,day2,day3 must be printed
How can I do this using pandas?
Upvotes: 4
Views: 81
Reputation: 59579
Here's one way that returns the length of the longest consecutive streak within each 'Email'
.
First drop_duplicates
that way repeated days for the same e-mail don't ruin any streaks, and sort. Then create labels for groups of consecutive days taking the cumsum of where the difference in days is not equal to 1
. Finally group by the 'Email'
and this group label and find the max size.
For clarity, I added an additional group at the end which has a streak of three on days 5,6,7.
print(df)
Email Day
[email protected] 1
[email protected] 1
[email protected] 1
[email protected] 2
[email protected] 3
[email protected] 3
[email protected] 1
[email protected] 5
[email protected] 6
[email protected] 7
df1 = df.drop_duplicates(['Email', 'Day']).sort_values(['Email', 'Day'])
s1 = df1.groupby('Email').Day.diff().ne(1).cumsum()
df1.groupby(['Email', s1]).size().groupby('Email').max()
Email
[email protected] 3
[email protected] 1
[email protected] 1
[email protected] 3
dtype: int64
Upvotes: 4
Reputation: 23099
If I'm understanding your requirement correctly we can use a groupby.diff
and .loc
to filter your values.
df.loc[df.groupby('Email')['Day'].diff().fillna(1).eq(1)].groupby('Email')['Day'].idxmax()
df1 = df.loc[idx]
Email Day
0 [email protected] 1
2 [email protected] 1
4 [email protected] 3
stepping through each part of the code :
df.groupby('Email')['Day'].diff().fillna(1)
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 2.0 # < -- we don't want this.
Name: Day, dtype: float64
Of the above values, we want the max day grouped by email returned as an index value.
print(idx)
Email
[email protected] 0
[email protected] 2
[email protected] 4
Name: Day, dtype: int64
Email Day
0 [email protected] 1 # this row.
1 [email protected] 1
2 [email protected] 1 # this row.
3 [email protected] 2
4 [email protected] 3
5 [email protected] 3 # this row.
if you want to filter out anyone who has no streaks, as in their greatest streak is one day, you could use
df.duplicated(subset=['Email'],keep=False)
Upvotes: 2