Malavika Venkatesh
Malavika Venkatesh

Reputation: 49

How to find the streak using email ids in Pandas Python

I have a DataFrame that has students and the days they have attended their classes

Email             Day
[email protected]     1
[email protected]     1
[email protected]     1
[email protected]     2
[email protected]     3
[email protected]     3

Expected Output:

Email                 Streak
[email protected]          1
[email protected]          1
[email protected]          3

The result must be in such a way that only those who attended the classes in a streak like day1,day2,day3 must be printed

How can I do this using pandas?

Upvotes: 4

Views: 81

Answers (2)

ALollz
ALollz

Reputation: 59579

Here's one way that returns the length of the longest consecutive streak within each 'Email'.

First drop_duplicates that way repeated days for the same e-mail don't ruin any streaks, and sort. Then create labels for groups of consecutive days taking the cumsum of where the difference in days is not equal to 1. Finally group by the 'Email' and this group label and find the max size.

For clarity, I added an additional group at the end which has a streak of three on days 5,6,7.

print(df)

Email             Day
[email protected]     1
[email protected]     1
[email protected]     1
[email protected]     2
[email protected]     3
[email protected]     3
[email protected]      1
[email protected]      5
[email protected]      6
[email protected]      7

df1 = df.drop_duplicates(['Email', 'Day']).sort_values(['Email', 'Day'])
s1 = df1.groupby('Email').Day.diff().ne(1).cumsum()

df1.groupby(['Email', s1]).size().groupby('Email').max()

Email
[email protected]     3
[email protected]    1
[email protected]    1
[email protected]    3
dtype: int64

Upvotes: 4

Umar.H
Umar.H

Reputation: 23099

If I'm understanding your requirement correctly we can use a groupby.diff and .loc to filter your values.

df.loc[df.groupby('Email')['Day'].diff().fillna(1).eq(1)].groupby('Email')['Day'].idxmax()


df1 = df.loc[idx]


            Email  Day
0  [email protected]    1
2  [email protected]    1
4  [email protected]    3

stepping through each part of the code :

df.groupby('Email')['Day'].diff().fillna(1)

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
5    2.0 # < -- we don't want this. 
Name: Day, dtype: float64

Of the above values, we want the max day grouped by email returned as an index value.

print(idx)

Email
[email protected]    0 
[email protected]    2
[email protected]    4
Name: Day, dtype: int64


            Email  Day
0  [email protected]    1 # this row.
1  [email protected]    1
2  [email protected]    1 # this row.
3  [email protected]    2
4  [email protected]    3
5  [email protected]    3 # this row. 

if you want to filter out anyone who has no streaks, as in their greatest streak is one day, you could use

df.duplicated(subset=['Email'],keep=False)

Upvotes: 2

Related Questions