Reputation: 16415
I have this dataset of active subjects during specified time-periods.
start end name
0 00:00 00:10 a
1 00:10 00:20 b
2 00:00 00:20 c
3 00:00 00:10 d
4 00:10 00:15 e
5 00:15 00:20 a
The intervals are inclusive on the left(start) side and not inclusive on the right(end).
There are always three subjects active. I want to increase the granularity of the data, so that I will have info of the three active subjects for each second. Each second has three unique values.
This would be the desired result for the test case.
slot1 slot2 slot3
0 a c d
1 a c d
2 a c d
3 a c d
4 a c d
5 a c d
6 a c d
7 a c d
8 a c d
9 a c d
10 b c e
11 b c e
12 b c e
13 b c e
14 b c e
15 b c a
16 b c a
17 b c a
18 b c a
19 b c a
The order of the subjects inside the slots is irrelevant for now. The subjects can reappear in the data like "a" from 00:00 to 00:10 and then again from 00:15 to 00:20. The intervals can be at any second.
Upvotes: 1
Views: 96
Reputation: 2271
This solution uses piso
(pandas interval set operations) and will run fast.
setup
Create data and convert to pandas.Timedelta
df = pd.DataFrame(
{
"start": ["00:00", "00:10", "00:00", "00:00", "00:10", "00:15"],
"end": ["00:10", "00:20", "00:20", "00:10", "00:15", "00:20"],
"name": ["a", "b", "c", "d", "e", "a"],
}
)
df[["start", "end"]] = ("00:" + df[["start", "end"]].astype(str)).apply(pd.to_timedelta)
create the sample times (a pandas.TimedeltaIndex
of seconds):
sample_times = pd.timedelta_range(df["start"].min(), df["end"].max(), freq="s")
solution
For each possible value of "name" create a pandas.IntervalIndex
which has the intervals defined by start and stop columns:
ii_series = df.groupby("name").apply(
lambda d: pd.IntervalIndex.from_arrays(d["start"], d["end"], closed="left")
)
ii_series
looks like this:
name
a IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
b IntervalIndex([[0 days 00:10:00, 0 days 00:20:...
c IntervalIndex([[0 days 00:00:00, 0 days 00:20:...
d IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
e IntervalIndex([[0 days 00:10:00, 0 days 00:15:...
dtype: object
Then to each of these interval index we'll apply the piso.contains
function, which can be used to test whether a set of points is contained in an interval
contained = ii_series.apply(piso.contains,x=sample_times, result="points")
contained
will be a dataframe indexed by the names, and whose columns are the sample times. The transpose of this, looks like:
a b c d e
0 days 00:00:00 True False True True False
0 days 00:00:01 True False True True False
0 days 00:00:02 True False True True False
0 days 00:00:03 True False True True False
0 days 00:00:04 True False True True False
... ... ... ... ... ...
0 days 00:19:56 True True True False False
0 days 00:19:57 True True True False False
0 days 00:19:58 True True True False False
0 days 00:19:59 True True True False False
0 days 00:20:00 False False False False False
This format of data may be easier to work with, depending on the application, but if you want to have it in the format stated in the question then you can create a series of lists, indexed by each second:
series_of_lists = (
contained.transpose()
.melt(ignore_index=False)
.query("value == True")
.reset_index()
.groupby("index")["name"]
.apply(pd.Series.to_list)
)
Then convert to dataframe:
pd.DataFrame(series_of_lists.to_list(), index=series_of_lists.index)
which will look like this:
0 1 2
index
0 days 00:00:00 a c d
0 days 00:00:01 a c d
0 days 00:00:02 a c d
0 days 00:00:03 a c d
0 days 00:00:04 a c d
... .. .. ..
0 days 00:19:55 a b c
0 days 00:19:56 a b c
0 days 00:19:57 a b c
0 days 00:19:58 a b c
0 days 00:19:59 a b c
Note: I am the creator of piso, feel free to reach out if you have any questions.
Upvotes: 1
Reputation: 150785
Route 1: One (costly but easy) way is to explode the data to the seconds, then merge 3 times:
time_df = (('00:' + df[['start','end']])
.apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
.astype(int)
.apply(lambda x: np.arange(*x), axis=1)
.to_frame('time')
.assign(slot=df['name'])
.explode('time')
)
(time_df.merge(time_df, on='time', suffixes=['1','2'])
.query('slot1 < slot2')
.merge(time_df, on='time')
.query('slot2 < slot')
)
Output:
time slot1 slot2 slot
2 0 a c d
11 1 a c d
20 2 a c d
29 3 a c d
38 4 a c d
47 5 a c d
56 6 a c d
65 7 a c d
74 8 a c d
83 9 a c d
92 10 b c e
101 11 b c e
110 12 b c e
119 13 b c e
128 14 b c e
139 15 a b c
148 16 a b c
157 17 a b c
166 18 a b c
175 19 a b c
Route 2: Another way is to cross merge then query the overlapping intervals:
df[['start','end']] = (('00:' + df[['start','end']])
.apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
.astype(int)
)
(df.merge(df, how='cross')
.assign(start=lambda x: x.filter(like='start').max(axis=1),
end=lambda x: x.filter(like='end').min(axis=1))
.query('start < end & name_x < name_y')
[['name_x','name_y','start','end']]
.merge(df, how='cross')
.assign(start=lambda x: x.filter(like='start').max(axis=1),
end=lambda x: x.filter(like='end').min(axis=1))
.query('start < end & name_y < name')
[['start','end', 'name_x','name_y', 'name']]
)
Output:
start end name_x name_y name
3 0 10 a c d
16 10 15 b c e
38 15 20 a b c
As you can see the this output is just the same as the other, but in the original form. Depending on your data, one route might better than the other.
Update Since your data has exactly 3 slot at any time, you can easily do with pivot
. This is the best solution.
# time_df as in Route 1
(time_df.sort_values(['time','slot'])
.assign(nums = lambda x: np.arange(len(x)) % 3)
.pivot('time', 'nums', 'slot')
)
# in general, `.assign(nums=lambda x: x.groupby('time').cumcount()`
# also works instead of the above
Output:
nums 0 1 2
time
0 a c d
1 a c d
2 a c d
3 a c d
4 a c d
5 a c d
6 a c d
7 a c d
8 a c d
9 a c d
10 b c e
11 b c e
12 b c e
13 b c e
14 b c e
15 a b c
16 a b c
17 a b c
18 a b c
19 a b c
Upvotes: 1