Reputation: 4897
Given a dataframe of intervals ('start time' and 'end time'), what would be an efficiant way to generate a histogram that show for each time point T, how many intervals are 'active' in it?
Code to generate a sample dataframe (apologies, I am sure there is a better way to do this):
from random import randint
intervals = []
for i in range(50):
start = randint(0,50)
intervals.append({'start':start, 'end':start+randint(0,50)})
intervals_df = pd.DataFrame(intervals)
An example:
For the following dataframe:
import pandas as pd
pd.DataFrame([{'start':2,'end':5},{'start':3,'end':8},{'start':9,'end':10},{'start':4,'end':5}])
The matching graph would be similar to:
My intuition is that the time intervals need to be broken somehow to discrete values so they could be binned, but how?
Upvotes: 1
Views: 1551
Reputation: 21
chrisb's answer is great, but the reason why you have a gap between 5 and 6 is because the default number of bins is 10 by default. This means that the bins are distant by 0.8 in your case, so the gap is actually between 5.2 and 6.
A way to counteract this is to manually set the number of bins equal to the length of your data:
nb_bins = int(max(expanded) - min(expanded))
expanded.hist(bins = nb_bins)
Result:
1
Upvotes: 1
Reputation: 52286
You could use apply
to generate the values for each range, then melt
to reshape the data into long form.
In [113]: expanded = df.apply(lambda row: pd.Series(np.arange(row['start'], row['end'] + 1)), axis=1)
In [114]: expanded
Out[114]:
0 1 2 3 4 5
0 2 3 4 5 NaN NaN
1 3 4 5 6 7 8
2 9 10 NaN NaN NaN NaN
3 4 5 NaN NaN NaN NaN
In [115]: expanded = pd.melt(expanded)['value'].dropna()
In [116]: expanded
Out[116]:
0 2
1 3
2 9
3 4
4 3
5 4
6 10
7 5
8 4
9 5
12 5
13 6
17 7
21 8
Name: value, dtype: float64
From there you could use the built-in histogram plot, or some other kind of binning.
In [117]: expanded.hist()
Upvotes: 3