Naik
Naik

Reputation: 1255

Cumulative count in pandas

I have a data frame that includes different columns. One of the columns is the name of the users (note that this not unique meaning that a name might appear at different rows) and one of the other columns is the date_time showing the time and date of logins by different users. Now, I want to build a new column showing the cumulative count of logins for each user on each day. Specifically, I want to add a new feature that tells me how many times the user logged before the current login on the same day. We should go through each row and based on the user name find the number of logins by the same user for the same day but before the current login and count them.

Could you please help me with how I can write this in the most efficient way?

The input dataframe is:

user  date_time
JK    2016-01-15 09:27:00
KL    2016-02-10 12:17:00
JK    2016-01-15 11:07:00
JK    2016-01-15 12:17:10
KL    2016-02-10 12:47:00
JK    2017-11-15 08:27:00

The output dataframe is:

user  date_time               cum_count
Jk    2016-01-15 09:27:00     0    
KL    2016-02-10 12:17:00     0
Jk    2016-01-15 11:07:00     1
JK    2016-01-15 12:17:10     2
KL    2016-02-10 12:47:00     1
Jk    2017-11-15 08:27:00     0

Upvotes: 2

Views: 2944

Answers (3)

Valdi_Bo
Valdi_Bo

Reputation: 30971

Pandas grouping contains cumcount function, so the obvious solution is just to use it.

And a note about grouping:

  • As I see, user column is written with either upper or lower case letters, so grouping by user name should be actually by upper case of user.
  • Grouping by date should be on the normalized date (with time part set to 00:00:00).

So the code can be:

df['cum_count'] = df.groupby([df.user.str.upper(), df.date_time.dt.normalize()])\
    .cumcount()

Upvotes: 1

ansev
ansev

Reputation: 30920

Use GroupBy.cumcount

df['cum_count'] = df.groupby(['user', df['date_time'].dt.date]).cumcount()
#if neccesary convert to datetime
#df['cum_count'] = df.groupby(['user',
#                              pd.to_datetime(df['date_time']).dt.date]).cumcount()


  user           date_time  cum_count
0   Jk 2016-01-15 09:27:00          0
1   KL 2016-02-10 12:17:00          0
2   Jk 2016-01-15 11:07:00          1
3   JK 2016-01-15 12:17:10          0
4   KL 2016-02-10 12:47:00          1
5   Jk 2017-11-15 08:27:00          0

You need Series.str.lower if you don't want to distinguish between upper and lower case in the name of the users

df['cum_count'] = (df.groupby([df['user'].str.lower(), 
                               df['date_time'].dt.date]).cumcount()

  user           date_time  cum_count
0   Jk 2016-01-15 09:27:00          0
1   KL 2016-02-10 12:17:00          0
2   Jk 2016-01-15 11:07:00          1
3   JK 2016-01-15 12:17:10          2
4   KL 2016-02-10 12:47:00          1
5   Jk 2017-11-15 08:27:00          0

Upvotes: 1

BENY
BENY

Reputation: 323226

Let us do groupby + cumcount

df.groupby([df.user,df.datetime.dt.date]).cumcount()

Upvotes: 4

Related Questions