Reputation: 1272
I am seeking to compute the rate of an event occurrence in a dataframe.
I have a dataframe in pandas that tracks the onset and duration of a certain event. So initially it will look something along the lines:
onset duration label channels end_time
0 1.5 0.1 HFO A1 10
1 2.0 1.0 HFO A2 10
2 3.0 1.0 HFO A3 10
3 5.5 0.1 HFO A1 10
where onset
and duration
and end_time
are in seconds. channels
denote a unique set of groups that I want to loop over.
I would like to get something like this:
rate_dict = {
'A1': 0.2, # rate of 0.2 per second (i.e. 2 occurrences over 10 second time frame)
'A2': 0.1, # rate of 0.1 per second
'A3': 0.1
}
First, I obtain a group based on channels
:
for idx, group in df.groupby(['channels']):
Then I convert things to a datetime index
onset duration label channels end_time
timestamp
2021-02-10 19:25:19.391130+00:00 1.5 0.1 HFO A1 10
2021-02-10 19:25:23.391130+00:00 5.5 0.1 HFO A1 10
Next, I considered re-indexing over the start (0 second) and end time (10 seconds in this case):
# rate is say 's' for creating a dummy row for every second
dt_idx = pd.date_range(ref_timestamp, end_timestamp, freq=rate)
group = group.reindex(dt_idx, fill_value=np.nan)
The issue is that it does not pick up the events that occur at 1.5 and 5.5 seconds for channel A1. So I end up getting essentially rows of all nans, whereas ideally I get a count of 2 in this period that I resampled.
Ideally, I can specify other rate strings (e.g. 'hr') and it would return the rate per hr. In this case that would be:
rate_dict = {
'A1': 2.0, # rate of 2 per hr (i.e. 2 occurrences over a 1 hour time frame)
'A2': 1.0, # rate of 1 per hr
'A3': 1.0
}
Upvotes: 3
Views: 866
Reputation: 410
First, we can recreate your table as a Pandas DataFrame:
import pandas as pd
d = {'onset': [1.5 ,2.0 ,3.0 ,5.5],
'duration': [0.1, 1.0, 1.0, 0.1],
'label': ['HFO', 'HFO', 'HFO', 'HFO'],
'channels': ['A1', 'A2', 'A3', 'A1'],
'end_time': [10.0, 10.0, 10.0, 10.0]}
df = pd.DataFrame(d)
To solve your question directly, in terms of occurrences-per-second, we can count occurrences and divide by the mean end_time
:
df.groupby('channels').end_time.agg(lambda x: x.count()/x.mean()).to_dict()
To generalize this, let's create a function to_freq
that takes as input both the series x
and the desired rate as a string rate
:
def to_freq(x, rate='s'):
d = {'s':1, 'm': 60, 'h': 60*60, 'd': 60*60*24}
f = x.count()/x.mean()
return f/d[rate]
Now, our original code becames:
df.groupby('channels').end_time.agg(lambda x: to_freq(x)).to_dict()
And we can find the occurrences-per-hour as follows:
df.groupby('channels').end_time.agg(lambda x: to_freq(x, rate='h')).to_dict()
Upvotes: 1
Reputation: 26676
Groupby channels, count occurrence and divide by mean of end time. Because occurrences per group are more less the group size, there is no harm localizing the occurrence to end_time in this case because there isn't much variation.
df.groupby('channels')['end_time'].agg(lambda x: x.count()/x.mean()).to_dict()
Upvotes: 0