ajl123
ajl123

Reputation: 1272

Computing rate of occurrences per unit of time in a pandas dataframe

I am seeking to compute the rate of an event occurrence in a dataframe.

I have a dataframe in pandas that tracks the onset and duration of a certain event. So initially it will look something along the lines:

   onset  duration label channels  end_time
0    1.5       0.1   HFO       A1  10
1    2.0       1.0   HFO       A2  10
2    3.0       1.0   HFO       A3  10
3    5.5       0.1   HFO       A1  10

where onset and duration and end_time are in seconds. channels denote a unique set of groups that I want to loop over.

Desired Output

I would like to get something like this:

rate_dict = {
  'A1': 0.2,  # rate of 0.2 per second (i.e. 2 occurrences over 10 second time frame)
  'A2': 0.1,  # rate of 0.1 per second
  'A3': 0.1
}

My attempt so far

First, I obtain a group based on channels:

for idx, group in df.groupby(['channels']):

Then I convert things to a datetime index

                             onset  duration label channels  end_time
timestamp                                                               
2021-02-10 19:25:19.391130+00:00    1.5       0.1   HFO       A1  10
2021-02-10 19:25:23.391130+00:00    5.5       0.1   HFO       A1  10

Next, I considered re-indexing over the start (0 second) and end time (10 seconds in this case):

# rate is say 's' for creating a dummy row for every second
dt_idx = pd.date_range(ref_timestamp, end_timestamp, freq=rate)
group = group.reindex(dt_idx, fill_value=np.nan)

The issue is that it does not pick up the events that occur at 1.5 and 5.5 seconds for channel A1. So I end up getting essentially rows of all nans, whereas ideally I get a count of 2 in this period that I resampled.

Desired Generalization

Ideally, I can specify other rate strings (e.g. 'hr') and it would return the rate per hr. In this case that would be:

rate_dict = {
      'A1': 2.0,  # rate of 2 per hr (i.e. 2 occurrences over a 1 hour time frame)
      'A2': 1.0,  # rate of 1 per hr
      'A3': 1.0
    }

Upvotes: 3

Views: 866

Answers (2)

Arthur D.
Arthur D.

Reputation: 410

First, we can recreate your table as a Pandas DataFrame:

import pandas as pd
d = {'onset': [1.5 ,2.0 ,3.0 ,5.5], 
     'duration': [0.1, 1.0, 1.0, 0.1],
     'label': ['HFO', 'HFO', 'HFO', 'HFO'],
     'channels': ['A1', 'A2', 'A3', 'A1'],
     'end_time': [10.0, 10.0, 10.0, 10.0]}

df = pd.DataFrame(d)

To solve your question directly, in terms of occurrences-per-second, we can count occurrences and divide by the mean end_time:

df.groupby('channels').end_time.agg(lambda x: x.count()/x.mean()).to_dict()

To generalize this, let's create a function to_freq that takes as input both the series x and the desired rate as a string rate:

def to_freq(x, rate='s'):
    d = {'s':1, 'm': 60, 'h': 60*60, 'd': 60*60*24}
    f = x.count()/x.mean()
    return f/d[rate]

Now, our original code becames:

df.groupby('channels').end_time.agg(lambda x: to_freq(x)).to_dict()

And we can find the occurrences-per-hour as follows:

df.groupby('channels').end_time.agg(lambda x: to_freq(x, rate='h')).to_dict()

Upvotes: 1

wwnde
wwnde

Reputation: 26676

Groupby channels, count occurrence and divide by mean of end time. Because occurrences per group are more less the group size, there is no harm localizing the occurrence to end_time in this case because there isn't much variation.

df.groupby('channels')['end_time'].agg(lambda x: x.count()/x.mean()).to_dict()

Upvotes: 0

Related Questions