Reputation: 316
TL:DR
I'd like to group by subject and 30 day time period, but the 30 day time period isn't individualized by subject.
What's the best way to handle this?
Full explanation
I've got a sample of participants who all started a scientific study at different times. I'd like to use TimeGrouper to segment by each 30 day period following their first day in the study.
After some searching, it seems like that might not be possible, because it's tough to specify a starting point for TimeGrouper. So, as a proxy, I could potentially use the first observed timestamp for each person.
To that end, I tried grouping by participant ID and TimeGrouper, but that 30 day period seems to begin counting from the earliest global time point, rather than the earliest time point for each participant.
I know this is a little complicated, so here is some code:
This is a fake dataframe that represents the type of data I'm working with:
fakedf = pd.DataFrame({'participantID':['subj1', 'subj1', 'subj1', 'subj1', 'subj2', 'subj2', 'subj2', 'subj2'],
'timestamp':['2015-06-25 01:12:00', '2015-06-30 11:02:00', '2015-07-05 09:33:00', '2015-07-28 07:22:00',
'2015-07-25 01:11:00', '2015-07-31 11:02:00', '2015-08-07 09:33:00', '2015-08-10 07:22:00'], 'studystart':['2015-06-20 00:00:00', '2015-06-20 00:00:00', '2015-06-20 00:00:00', '2015-06-20 00:00:00',
'2015-07-25 00:00:00', '2015-07-25 00:00:00', '2015-07-25 00:00:00', '2015-07-25 00:00:00']})
fakedf.index = pd.to_datetime(fakedf.timestamp)
Above code should create this data frame:
Below is the code that I wish actually worked:
fakedf.groupby(['participantID', pd.TimeGrouper(freq="30D", closed='left')]).count()
Here's the output:
You can see that subj1 and subj2 start their timegroupings at 2015-06-25, despite the fact that subj2 doesn't have a real timestamp until 2015-07-25.
I would be happy if I could start each 30 day timegrouping by either:
a) The study start date, or
b) The first timestamp per participant
I have a low-tech solution that I know will work, but I was hoping for a nice, elegant TimeGrouper solution.
Thanks in advance!
Upvotes: 2
Views: 285
Reputation: 33803
To get the TimeGrouper
to be at the participant level, first do a groupby
on 'participantID'
and then within each group, do another groupby
on the TimeGrouper
. For the sake of clarity, I've separated out the second groupby
as a separate function.
def inner_groupby(grp, key=None):
return grp.groupby(pd.TimeGrouper(key=key, freq='30D')).count()
fakedf.groupby('participantID').apply(inner_groupby)
The resulting output:
participantID studystart timestamp
participantID timestamp
subj1 2015-06-25 01:12:00 3 3 3
2015-07-25 01:12:00 1 1 1
subj2 2015-07-25 01:11:00 4 4 4
You don't need to specify a key
for TimeGrouper
. By default, I believe it will use the index. However, if you want to the TimeGrouper
to be over other columns, like 'studystart'
, you'd pass it via the key
parameter:
fakedf.groupby('participantID').apply(inner_groupby, key='studystart')
And the resulting output for key='studystart'
:
participantID timestamp
participantID studystart
subj1 2015-06-20 4 4
subj2 2015-07-25 4 4
Upvotes: 2