How to find the streak using email ids in Pandas Python

Question

I have a DataFrame that has students and the days they have attended their classes

Email             Day
mala@gmail.com     1
vika@gmail.com     1
rupa@gmail.com     1
vika@gmail.com     2
vika@gmail.com     3
rupa@gmail.com     3

Expected Output:

Email                 Streak
mala@gmail.com          1
rupa@gmail.com          1
vika@gmail.com          3

The result must be in such a way that only those who attended the classes in a streak like day1,day2,day3 must be printed

How can I do this using pandas?

ALollz · Accepted Answer

Here's one way that returns the length of the longest consecutive streak within each 'Email'.

First drop_duplicates that way repeated days for the same e-mail don't ruin any streaks, and sort. Then create labels for groups of consecutive days taking the cumsum of where the difference in days is not equal to 1. Finally group by the 'Email' and this group label and find the max size.

For clarity, I added an additional group at the end which has a streak of three on days 5,6,7.

print(df)

Email             Day
mala@gmail.com     1
vika@gmail.com     1
rupa@gmail.com     1
vika@gmail.com     2
vika@gmail.com     3
rupa@gmail.com     3
foo@gmail.com      1
foo@gmail.com      5
foo@gmail.com      6
foo@gmail.com      7

df1 = df.drop_duplicates(['Email', 'Day']).sort_values(['Email', 'Day'])
s1 = df1.groupby('Email').Day.diff().ne(1).cumsum()

df1.groupby(['Email', s1]).size().groupby('Email').max()

Email
foo@gmail.com     3
mala@gmail.com    1
rupa@gmail.com    1
vika@gmail.com    3
dtype: int64

How to find the streak using email ids in Pandas Python

Answers (2)

Related Questions