Reid
Reid

Reputation: 316

PANDAS TimeGrouper with individualized starting point for downsample

TL:DR

I'd like to group by subject and 30 day time period, but the 30 day time period isn't individualized by subject.

What's the best way to handle this?

Full explanation

I've got a sample of participants who all started a scientific study at different times. I'd like to use TimeGrouper to segment by each 30 day period following their first day in the study.

After some searching, it seems like that might not be possible, because it's tough to specify a starting point for TimeGrouper. So, as a proxy, I could potentially use the first observed timestamp for each person.

To that end, I tried grouping by participant ID and TimeGrouper, but that 30 day period seems to begin counting from the earliest global time point, rather than the earliest time point for each participant.

I know this is a little complicated, so here is some code:

This is a fake dataframe that represents the type of data I'm working with:

fakedf = pd.DataFrame({'participantID':['subj1', 'subj1', 'subj1', 'subj1', 'subj2', 'subj2', 'subj2', 'subj2'], 
                   'timestamp':['2015-06-25 01:12:00', '2015-06-30 11:02:00', '2015-07-05 09:33:00', '2015-07-28 07:22:00', 
        '2015-07-25 01:11:00', '2015-07-31 11:02:00', '2015-08-07 09:33:00', '2015-08-10 07:22:00'], 'studystart':['2015-06-20 00:00:00', '2015-06-20 00:00:00', '2015-06-20 00:00:00', '2015-06-20 00:00:00', 
        '2015-07-25 00:00:00', '2015-07-25 00:00:00', '2015-07-25 00:00:00', '2015-07-25 00:00:00']})

fakedf.index = pd.to_datetime(fakedf.timestamp)

Above code should create this data frame:

fake data frame

Below is the code that I wish actually worked:

fakedf.groupby(['participantID', pd.TimeGrouper(freq="30D",  closed='left')]).count()

Here's the output:

enter image description here

You can see that subj1 and subj2 start their timegroupings at 2015-06-25, despite the fact that subj2 doesn't have a real timestamp until 2015-07-25.

I would be happy if I could start each 30 day timegrouping by either:

a) The study start date, or

b) The first timestamp per participant

I have a low-tech solution that I know will work, but I was hoping for a nice, elegant TimeGrouper solution.

Thanks in advance!

Upvotes: 2

Views: 285

Answers (1)

root
root

Reputation: 33803

To get the TimeGrouper to be at the participant level, first do a groupby on 'participantID' and then within each group, do another groupby on the TimeGrouper. For the sake of clarity, I've separated out the second groupby as a separate function.

def inner_groupby(grp, key=None):
    return grp.groupby(pd.TimeGrouper(key=key, freq='30D')).count()

fakedf.groupby('participantID').apply(inner_groupby)

The resulting output:

                                   participantID  studystart  timestamp
participantID timestamp                                                
subj1         2015-06-25 01:12:00              3           3          3
              2015-07-25 01:12:00              1           1          1
subj2         2015-07-25 01:11:00              4           4          4

You don't need to specify a key for TimeGrouper. By default, I believe it will use the index. However, if you want to the TimeGrouper to be over other columns, like 'studystart', you'd pass it via the key parameter:

fakedf.groupby('participantID').apply(inner_groupby, key='studystart')

And the resulting output for key='studystart':

                          participantID  timestamp
participantID studystart                          
subj1         2015-06-20              4          4
subj2         2015-07-25              4          4

Upvotes: 2

Related Questions