Reputation: 7089
I have a dataframe and I'm trying to count the number of people who've joined a group by date. So this:
individual_id group_id date
a 1 2000-01-01
a 1 2000-01-02
a 1 2000-01-03
b 1 2000-01-02
b 1 2000-01-04
c 1 2000-01-03
c 1 2000-01-04
d 2 2000-01-02
Would become this:
individual_id group_id date people_in_group
a 1 2000-01-01 1
a 1 2000-01-02 2
a 1 2000-01-03 3
b 1 2000-01-02 2
b 1 2000-01-04 3
c 1 2000-01-03 3
c 1 2000-01-04 3
d 2 2000-01-02 1
Upvotes: 0
Views: 1089
Reputation: 20344
First, you can use GroupBy to find out how many joined on each date - i.e.
import pandas as pd
from datetime import datetime
import numpy as np
df = pd.DataFrame({'individual_id':['a','a','a','b','b','c','c','d'],
'group_id':[1,1,1,1,1,1,1,2],
'date':[datetime(2000,01,01),datetime(2000,01,02),
datetime(2000,01,03),datetime(2000,01,05),
datetime(2000,01,06),datetime(2000,01,03),
datetime(2000,01,04),datetime(2000,01,02)]})
#df = <dataframe of your original data (mocked up above)>
#Add a placeholder 'rowCounter' column, so that the groups are easily counted.
df['rowCounter'] = np.ones(len(df))
df1 = df.groupby(['individual_id','group_id','date'], as_index=False).sum()
Then, use cumsum()
function to total them up to and including the date
df1['people_in_group'] = df1.groupby(['individual_id','group_id'], as_index=False)['rowCounter'].transform(pd.Series.cumsum)
Optionally, remove the dummy row counter column we created:
df1 = df1.drop('rowCounter',1)
A print of df1 now shows
individual_id group_id date people_in_group
0 a 1 2000-01-01 1
1 a 1 2000-01-02 2
2 a 1 2000-01-03 3
3 b 1 2000-01-05 1
4 b 1 2000-01-06 2
5 c 1 2000-01-03 1
6 c 1 2000-01-04 2
7 d 2 2000-01-02 1
Upvotes: 1