Luke
Luke

Reputation: 7089

Counting entries by sub-categories and date in pandas

I have a dataframe and I'm trying to count the number of people who've joined a group by date. So this:

individual_id   group_id     date  
   a              1       2000-01-01  
   a              1       2000-01-02  
   a              1       2000-01-03  
   b              1       2000-01-02  
   b              1       2000-01-04  
   c              1       2000-01-03  
   c              1       2000-01-04  
   d              2       2000-01-02  

Would become this:

individual_id   group_id     date      people_in_group
   a              1       2000-01-01         1
   a              1       2000-01-02         2
   a              1       2000-01-03         3
   b              1       2000-01-02         2
   b              1       2000-01-04         3
   c              1       2000-01-03         3
   c              1       2000-01-04         3
   d              2       2000-01-02         1

Upvotes: 0

Views: 1089

Answers (1)

J Richard Snape
J Richard Snape

Reputation: 20344

First, you can use GroupBy to find out how many joined on each date - i.e.

import pandas as pd
from datetime import datetime
import numpy as np

df = pd.DataFrame({'individual_id':['a','a','a','b','b','c','c','d'],
                   'group_id':[1,1,1,1,1,1,1,2],
                   'date':[datetime(2000,01,01),datetime(2000,01,02),
                           datetime(2000,01,03),datetime(2000,01,05),
                           datetime(2000,01,06),datetime(2000,01,03),
                           datetime(2000,01,04),datetime(2000,01,02)]})

#df = <dataframe of your original data (mocked up above)>
#Add a placeholder 'rowCounter' column, so that the groups are easily counted.
df['rowCounter'] = np.ones(len(df))    
df1  = df.groupby(['individual_id','group_id','date'], as_index=False).sum()

Then, use cumsum() function to total them up to and including the date

df1['people_in_group'] = df1.groupby(['individual_id','group_id'], as_index=False)['rowCounter'].transform(pd.Series.cumsum)

Optionally, remove the dummy row counter column we created:

df1 = df1.drop('rowCounter',1)

A print of df1 now shows

  individual_id  group_id       date  people_in_group
0             a         1 2000-01-01                1
1             a         1 2000-01-02                2
2             a         1 2000-01-03                3
3             b         1 2000-01-05                1
4             b         1 2000-01-06                2
5             c         1 2000-01-03                1
6             c         1 2000-01-04                2
7             d         2 2000-01-02                1

Upvotes: 1

Related Questions