Tom Johnson
Tom Johnson

Reputation: 2154

Pandas groupby a timedelta greater than N minutes

I have a dataframe with columns for the timetag, a satellite ID, and a site ID. My goal is to break up the data set into individual "tracks" where each "track" is a unique combination of satellite and site IDs. I can do this easily using the standard pandas groupby functionality and specifying by=['site', 'sat']. But a further caveat is that if there is a time gap of more than N minutes within a group, then the data after the time gap should become a new "track".

My question is what's the best way to calculate the timedelta between successive rows within my (site, sat) group, identify when the timedelta is greater than N minutes, and create a new group/track?

I think I can calculate the time delta between rows using the diff() method. Ideally there would be a way to add a third key to my groupby call that encapsulates the time constraint I'm using.

Here is some sample code to generate a test data set and do the initial site, sat grouping.

import pandas as pd
import numpy as np

# Create first sample set.

N=10
A_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='1s')
A_data = np.arange(0, N)
A_site = ['X'] * N
A_sat = 12345

# Create second sample set over the same time span but with a different sat

N=5
B_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='1s')
B_data = np.arange(0, N)
B_site = ['X'] * N
B_sat = 3456


# Create a third sample set with a new site over the same time span but the
# same sat as the second set

N = 10
C_times = pd.date_range('2016-01-01T00:01:00', periods=N, freq='1s')
C_data = np.arange(0, N)
C_site = ['Y'] * N
C_sat = 3456

# Create a fourth sample set with the same sat and site as the third set but
# more than 20 minutes after the third set.

N = 5
D_times = pd.date_range('2016-01-01T01:00:00', periods=N, freq='1s')
D_data = np.arange(0, N)
D_site = ['Y'] * N
D_sat = 3456

# Build a data frame for each sample set

A = pd.DataFrame(index=A_times, data={'data': A_data, 'site' : A_site, 'sat' : A_sat})
B = pd.DataFrame(index=B_times, data={'data': B_data, 'site' : B_site, 'sat' : B_sat})
C = pd.DataFrame(index=C_times, data={'data': C_data, 'site' : C_site, 'sat' : C_sat})
D = pd.DataFrame(index=D_times, data={'data': D_data, 'site' : D_site, 'sat' : D_sat})

# mash them into one larger test data frame

test = pd.concat([A, B, C, D])

print(test)

my_groups = test.groupby(by = ['site', 'sat'])

for key, g in my_groups:
    print(key)
    print(g)

The output of this is

test =

                     data    sat site
2016-01-01 00:00:00     0  12345    X
2016-01-01 00:00:01     1  12345    X
2016-01-01 00:00:02     2  12345    X
2016-01-01 00:00:03     3  12345    X
2016-01-01 00:00:04     4  12345    X
2016-01-01 00:00:05     5  12345    X
2016-01-01 00:00:06     6  12345    X
2016-01-01 00:00:07     7  12345    X
2016-01-01 00:00:08     8  12345    X
2016-01-01 00:00:09     9  12345    X
2016-01-01 00:00:00     0   3456    X
2016-01-01 00:00:01     1   3456    X
2016-01-01 00:00:02     2   3456    X
2016-01-01 00:00:03     3   3456    X
2016-01-01 00:00:04     4   3456    X
2016-01-01 00:01:00     0   3456    Y
2016-01-01 00:01:01     1   3456    Y
2016-01-01 00:01:02     2   3456    Y
2016-01-01 00:01:03     3   3456    Y
2016-01-01 00:01:04     4   3456    Y
2016-01-01 00:01:05     5   3456    Y
2016-01-01 00:01:06     6   3456    Y
2016-01-01 00:01:07     7   3456    Y
2016-01-01 00:01:08     8   3456    Y
2016-01-01 00:01:09     9   3456    Y
2016-01-01 01:00:00     0   3456    Y
2016-01-01 01:00:01     1   3456    Y
2016-01-01 01:00:02     2   3456    Y
2016-01-01 01:00:03     3   3456    Y
2016-01-01 01:00:04     4   3456    Y

and the various groups are

('X', 3456)
                     data   sat site
2016-01-01 00:00:00     0  3456    X
2016-01-01 00:00:01     1  3456    X
2016-01-01 00:00:02     2  3456    X
2016-01-01 00:00:03     3  3456    X
2016-01-01 00:00:04     4  3456    X
('X', 12345)
                     data    sat site
2016-01-01 00:00:00     0  12345    X
2016-01-01 00:00:01     1  12345    X
2016-01-01 00:00:02     2  12345    X
2016-01-01 00:00:03     3  12345    X
2016-01-01 00:00:04     4  12345    X
2016-01-01 00:00:05     5  12345    X
2016-01-01 00:00:06     6  12345    X
2016-01-01 00:00:07     7  12345    X
2016-01-01 00:00:08     8  12345    X
2016-01-01 00:00:09     9  12345    X
('Y', 3456)
                     data   sat site
2016-01-01 00:01:00     0  3456    Y
2016-01-01 00:01:01     1  3456    Y
2016-01-01 00:01:02     2  3456    Y
2016-01-01 00:01:03     3  3456    Y
2016-01-01 00:01:04     4  3456    Y
2016-01-01 00:01:05     5  3456    Y
2016-01-01 00:01:06     6  3456    Y
2016-01-01 00:01:07     7  3456    Y
2016-01-01 00:01:08     8  3456    Y
2016-01-01 00:01:09     9  3456    Y
2016-01-01 01:00:00     0  3456    Y
2016-01-01 01:00:01     1  3456    Y
2016-01-01 01:00:02     2  3456    Y
2016-01-01 01:00:03     3  3456    Y
2016-01-01 01:00:04     4  3456    Y

The desired behavior is that the third group above should really be split into two groups because of the 20 minute gap in the data, e.g.

('Y', 3456)
                     data   sat site
2016-01-01 00:01:00     0  3456    Y
2016-01-01 00:01:01     1  3456    Y
2016-01-01 00:01:02     2  3456    Y
2016-01-01 00:01:03     3  3456    Y
2016-01-01 00:01:04     4  3456    Y
2016-01-01 00:01:05     5  3456    Y
2016-01-01 00:01:06     6  3456    Y
2016-01-01 00:01:07     7  3456    Y
2016-01-01 00:01:08     8  3456    Y
2016-01-01 00:01:09     9  3456    Y

New group here

2016-01-01 01:00:00     0  3456    Y
2016-01-01 01:00:01     1  3456    Y
2016-01-01 01:00:02     2  3456    Y
2016-01-01 01:00:03     3  3456    Y
2016-01-01 01:00:04     4  3456    Y

Any suggestions would be appreciated. Thanks!

Upvotes: 2

Views: 2440

Answers (1)

Tom Johnson
Tom Johnson

Reputation: 2154

After fiddling around with unutbu's solution and combining it with the suggestions from this post I was able to solve the problem. A more complete example test set and solution(s) are shown below.

# In[222]:

import pandas as pd
import numpy as np


# In[223]:

# Create 1st sample set.

N=10
A_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='1s')
A_data = np.arange(0, N)
A_site = ['X'] * N
A_sat = 12345

# Create 2nd sample set over the same time span but with a different sat

N=5
B_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='1s')
B_data = np.arange(0, N)
B_site = ['X'] * N
B_sat = 3456


# Create a 3rd sample set with a new site over the same time span but the
# same sat as the 2nd set

N = 10
C_times = pd.date_range('2016-01-01T00:01:00', periods=N, freq='1s')
C_data = np.arange(0, N)
C_site = ['Y'] * N
C_sat = 3456

# Create a 4th sample set with the same sat and site as the 3rd set but
# more than 20 minutes after the third set.

N = 5
D_times = pd.date_range('2016-01-01T01:00:00', periods=N, freq='1s')
D_data = np.arange(0, N)
D_site = ['Y'] * N
D_sat = 3456

# Create a 5th sample set with the same sat and site as the 3rd set but
# more than 20 minutes after the third set.

N = 60
E_times = pd.date_range('2016-01-01T00:00:00', periods=N, freq='60s')
E_data = np.arange(0, N)
E_site = ['Z'] * N
E_sat = 3456

# Create a 6th sample set with the same sat and site as the 4th set but
# more than 20 minutes after the fourth set.

N = 5
F_times = pd.date_range('2016-01-02T00:00:00', periods=N, freq='60s')
F_data = np.arange(0, N)
F_site = ['Y'] * N
F_sat = 3456


# In[224]:

# Build a data frame for each sample set

A = pd.DataFrame(data={'time': A_times, 'data': A_data, 'site' : A_site, 'sat' : A_sat})
B = pd.DataFrame(data={'time': B_times, 'data': B_data, 'site' : B_site, 'sat' : B_sat})
C = pd.DataFrame(data={'time': C_times, 'data': C_data, 'site' : C_site, 'sat' : C_sat})
D = pd.DataFrame(data={'time': D_times, 'data': D_data, 'site' : D_site, 'sat' : D_sat})
E = pd.DataFrame(data={'time': E_times, 'data': E_data, 'site' : E_site, 'sat' : E_sat})
F = pd.DataFrame(data={'time': F_times, 'data': F_data, 'site' : F_site, 'sat' : F_sat})

# mash them into one larger test data frame

test = pd.concat([A, B, C, D, E, F])


# In[225]:

print(test)


# In[226]:

test.sort_values(['time'], inplace=True)


# In[227]:

# This approach doesn't quite work. Note that the group (Y, 3456, 0) really 
# has 2 tracks in it because the overlapping track from (Z, 3456, 0) is screwing up
# the delta-t calculation and hiding the fact that within the Y, 3456 group there
# was a large time gap.

test1 = test.copy()
test1['delta_t'] = test1['time'].diff()
test1['track'] = (test1['delta_t'] > pd.Timedelta(minutes=20)).cumsum()

my_groups = test1.groupby(by = ['site', 'sat', 'track'])
for key, g in my_groups:
    print(key)
    print(g)

This has output that looks like this:

('X', 3456, 0)
   data   sat site                time  delta_t  track
0     0  3456    X 2016-01-01 00:00:00   0 days      0
1     1  3456    X 2016-01-01 00:00:01   0 days      0
2     2  3456    X 2016-01-01 00:00:02   0 days      0
3     3  3456    X 2016-01-01 00:00:03   0 days      0
4     4  3456    X 2016-01-01 00:00:04   0 days      0
('X', 12345, 0)
   data    sat site                time  delta_t  track
0     0  12345    X 2016-01-01 00:00:00      NaT      0
1     1  12345    X 2016-01-01 00:00:01 00:00:01      0
2     2  12345    X 2016-01-01 00:00:02 00:00:01      0
3     3  12345    X 2016-01-01 00:00:03 00:00:01      0
4     4  12345    X 2016-01-01 00:00:04 00:00:01      0
5     5  12345    X 2016-01-01 00:00:05 00:00:01      0
6     6  12345    X 2016-01-01 00:00:06 00:00:01      0
7     7  12345    X 2016-01-01 00:00:07 00:00:01      0
8     8  12345    X 2016-01-01 00:00:08 00:00:01      0
9     9  12345    X 2016-01-01 00:00:09 00:00:01      0
('Y', 3456, 0)
   data   sat site                time  delta_t  track
0     0  3456    Y 2016-01-01 00:01:00 00:00:00      0
1     1  3456    Y 2016-01-01 00:01:01 00:00:01      0
2     2  3456    Y 2016-01-01 00:01:02 00:00:01      0
3     3  3456    Y 2016-01-01 00:01:03 00:00:01      0
4     4  3456    Y 2016-01-01 00:01:04 00:00:01      0
5     5  3456    Y 2016-01-01 00:01:05 00:00:01      0
6     6  3456    Y 2016-01-01 00:01:06 00:00:01      0
7     7  3456    Y 2016-01-01 00:01:07 00:00:01      0
8     8  3456    Y 2016-01-01 00:01:08 00:00:01      0
9     9  3456    Y 2016-01-01 00:01:09 00:00:01      0
0     0  3456    Y 2016-01-01 01:00:00 00:01:00      0
1     1  3456    Y 2016-01-01 01:00:01 00:00:01      0
2     2  3456    Y 2016-01-01 01:00:02 00:00:01      0
3     3  3456    Y 2016-01-01 01:00:03 00:00:01      0
4     4  3456    Y 2016-01-01 01:00:04 00:00:01      0
('Y', 3456, 1)
   data   sat site                time  delta_t  track
0     0  3456    Y 2016-01-02 00:00:00 22:59:56      1
1     1  3456    Y 2016-01-02 00:01:00 00:01:00      1
2     2  3456    Y 2016-01-02 00:02:00 00:01:00      1
3     3  3456    Y 2016-01-02 00:03:00 00:01:00      1
4     4  3456    Y 2016-01-02 00:04:00 00:01:00      1
('Z', 3456, 0)
    data   sat site                time  delta_t  track
0      0  3456    Z 2016-01-01 00:00:00 00:00:00      0
1      1  3456    Z 2016-01-01 00:01:00 00:00:51      0
2      2  3456    Z 2016-01-01 00:02:00 00:00:51      0
3      3  3456    Z 2016-01-01 00:03:00 00:01:00      0
4      4  3456    Z 2016-01-01 00:04:00 00:01:00      0
5      5  3456    Z 2016-01-01 00:05:00 00:01:00      0
6      6  3456    Z 2016-01-01 00:06:00 00:01:00      0
7      7  3456    Z 2016-01-01 00:07:00 00:01:00      0
8      8  3456    Z 2016-01-01 00:08:00 00:01:00      0
9      9  3456    Z 2016-01-01 00:09:00 00:01:00      0
10    10  3456    Z 2016-01-01 00:10:00 00:01:00      0
11    11  3456    Z 2016-01-01 00:11:00 00:01:00      0
12    12  3456    Z 2016-01-01 00:12:00 00:01:00      0
13    13  3456    Z 2016-01-01 00:13:00 00:01:00      0
14    14  3456    Z 2016-01-01 00:14:00 00:01:00      0
15    15  3456    Z 2016-01-01 00:15:00 00:01:00      0
16    16  3456    Z 2016-01-01 00:16:00 00:01:00      0
17    17  3456    Z 2016-01-01 00:17:00 00:01:00      0
18    18  3456    Z 2016-01-01 00:18:00 00:01:00      0
19    19  3456    Z 2016-01-01 00:19:00 00:01:00      0
20    20  3456    Z 2016-01-01 00:20:00 00:01:00      0
21    21  3456    Z 2016-01-01 00:21:00 00:01:00      0
22    22  3456    Z 2016-01-01 00:22:00 00:01:00      0
23    23  3456    Z 2016-01-01 00:23:00 00:01:00      0
24    24  3456    Z 2016-01-01 00:24:00 00:01:00      0
25    25  3456    Z 2016-01-01 00:25:00 00:01:00      0
26    26  3456    Z 2016-01-01 00:26:00 00:01:00      0
27    27  3456    Z 2016-01-01 00:27:00 00:01:00      0
28    28  3456    Z 2016-01-01 00:28:00 00:01:00      0
29    29  3456    Z 2016-01-01 00:29:00 00:01:00      0
30    30  3456    Z 2016-01-01 00:30:00 00:01:00      0
31    31  3456    Z 2016-01-01 00:31:00 00:01:00      0
32    32  3456    Z 2016-01-01 00:32:00 00:01:00      0
33    33  3456    Z 2016-01-01 00:33:00 00:01:00      0
34    34  3456    Z 2016-01-01 00:34:00 00:01:00      0
35    35  3456    Z 2016-01-01 00:35:00 00:01:00      0
36    36  3456    Z 2016-01-01 00:36:00 00:01:00      0
37    37  3456    Z 2016-01-01 00:37:00 00:01:00      0
38    38  3456    Z 2016-01-01 00:38:00 00:01:00      0
39    39  3456    Z 2016-01-01 00:39:00 00:01:00      0
40    40  3456    Z 2016-01-01 00:40:00 00:01:00      0
41    41  3456    Z 2016-01-01 00:41:00 00:01:00      0
42    42  3456    Z 2016-01-01 00:42:00 00:01:00      0
43    43  3456    Z 2016-01-01 00:43:00 00:01:00      0
44    44  3456    Z 2016-01-01 00:44:00 00:01:00      0
45    45  3456    Z 2016-01-01 00:45:00 00:01:00      0
46    46  3456    Z 2016-01-01 00:46:00 00:01:00      0
47    47  3456    Z 2016-01-01 00:47:00 00:01:00      0
48    48  3456    Z 2016-01-01 00:48:00 00:01:00      0
49    49  3456    Z 2016-01-01 00:49:00 00:01:00      0
50    50  3456    Z 2016-01-01 00:50:00 00:01:00      0
51    51  3456    Z 2016-01-01 00:51:00 00:01:00      0
52    52  3456    Z 2016-01-01 00:52:00 00:01:00      0
53    53  3456    Z 2016-01-01 00:53:00 00:01:00      0
54    54  3456    Z 2016-01-01 00:54:00 00:01:00      0
55    55  3456    Z 2016-01-01 00:55:00 00:01:00      0
56    56  3456    Z 2016-01-01 00:56:00 00:01:00      0
57    57  3456    Z 2016-01-01 00:57:00 00:01:00      0
58    58  3456    Z 2016-01-01 00:58:00 00:01:00      0
59    59  3456    Z 2016-01-01 00:59:00 00:01:00      0

Note that the group for ('Y', 3456, 0) really has two tracks in it. So not quite a complete solution. Continuing on I tried this

# In[228]:

# This method works. The difference is that when I calculated the 
# delta_t I did it on the results of the groupby.

# There's a undesireable effect that the track counter doesn't reset with each new
# (site, sat) pair. It appears to keep counting up.

test2 = test.copy()
test2.sort_values(['site', 'sat', 'time'], inplace=True)
test2['delta_t'] = test2.groupby(['site', 'sat'])['time'].diff()       
test2['track'] = (test2['delta_t'] > pd.Timedelta(minutes=20)).cumsum()

my_groups = test2.groupby(by = ['site', 'sat', 'track'])
for key, g in my_groups:
    print(key)
    print(g)

which has output

('X', 3456, 0)
   data   sat site                time  delta_t  track
0     0  3456    X 2016-01-01 00:00:00      NaT      0
1     1  3456    X 2016-01-01 00:00:01 00:00:01      0
2     2  3456    X 2016-01-01 00:00:02 00:00:01      0
3     3  3456    X 2016-01-01 00:00:03 00:00:01      0
4     4  3456    X 2016-01-01 00:00:04 00:00:01      0
('X', 12345, 0)
   data    sat site                time  delta_t  track
0     0  12345    X 2016-01-01 00:00:00      NaT      0
1     1  12345    X 2016-01-01 00:00:01 00:00:01      0
2     2  12345    X 2016-01-01 00:00:02 00:00:01      0
3     3  12345    X 2016-01-01 00:00:03 00:00:01      0
4     4  12345    X 2016-01-01 00:00:04 00:00:01      0
5     5  12345    X 2016-01-01 00:00:05 00:00:01      0
6     6  12345    X 2016-01-01 00:00:06 00:00:01      0
7     7  12345    X 2016-01-01 00:00:07 00:00:01      0
8     8  12345    X 2016-01-01 00:00:08 00:00:01      0
9     9  12345    X 2016-01-01 00:00:09 00:00:01      0
('Y', 3456, 0)
   data   sat site                time  delta_t  track
0     0  3456    Y 2016-01-01 00:01:00      NaT      0
1     1  3456    Y 2016-01-01 00:01:01 00:00:01      0
2     2  3456    Y 2016-01-01 00:01:02 00:00:01      0
3     3  3456    Y 2016-01-01 00:01:03 00:00:01      0
4     4  3456    Y 2016-01-01 00:01:04 00:00:01      0
5     5  3456    Y 2016-01-01 00:01:05 00:00:01      0
6     6  3456    Y 2016-01-01 00:01:06 00:00:01      0
7     7  3456    Y 2016-01-01 00:01:07 00:00:01      0
8     8  3456    Y 2016-01-01 00:01:08 00:00:01      0
9     9  3456    Y 2016-01-01 00:01:09 00:00:01      0
('Y', 3456, 1)
   data   sat site                time  delta_t  track
0     0  3456    Y 2016-01-01 01:00:00 00:58:51      1
1     1  3456    Y 2016-01-01 01:00:01 00:00:01      1
2     2  3456    Y 2016-01-01 01:00:02 00:00:01      1
3     3  3456    Y 2016-01-01 01:00:03 00:00:01      1
4     4  3456    Y 2016-01-01 01:00:04 00:00:01      1
('Y', 3456, 2)
   data   sat site                time  delta_t  track
0     0  3456    Y 2016-01-02 00:00:00 22:59:56      2
1     1  3456    Y 2016-01-02 00:01:00 00:01:00      2
2     2  3456    Y 2016-01-02 00:02:00 00:01:00      2
3     3  3456    Y 2016-01-02 00:03:00 00:01:00      2
4     4  3456    Y 2016-01-02 00:04:00 00:01:00      2
('Z', 3456, 2)
    data   sat site                time  delta_t  track
0      0  3456    Z 2016-01-01 00:00:00      NaT      2
1      1  3456    Z 2016-01-01 00:01:00 00:01:00      2
2      2  3456    Z 2016-01-01 00:02:00 00:01:00      2
3      3  3456    Z 2016-01-01 00:03:00 00:01:00      2
4      4  3456    Z 2016-01-01 00:04:00 00:01:00      2
5      5  3456    Z 2016-01-01 00:05:00 00:01:00      2
6      6  3456    Z 2016-01-01 00:06:00 00:01:00      2
7      7  3456    Z 2016-01-01 00:07:00 00:01:00      2
8      8  3456    Z 2016-01-01 00:08:00 00:01:00      2
9      9  3456    Z 2016-01-01 00:09:00 00:01:00      2
10    10  3456    Z 2016-01-01 00:10:00 00:01:00      2
11    11  3456    Z 2016-01-01 00:11:00 00:01:00      2
12    12  3456    Z 2016-01-01 00:12:00 00:01:00      2
13    13  3456    Z 2016-01-01 00:13:00 00:01:00      2
14    14  3456    Z 2016-01-01 00:14:00 00:01:00      2
15    15  3456    Z 2016-01-01 00:15:00 00:01:00      2
16    16  3456    Z 2016-01-01 00:16:00 00:01:00      2
17    17  3456    Z 2016-01-01 00:17:00 00:01:00      2
18    18  3456    Z 2016-01-01 00:18:00 00:01:00      2
19    19  3456    Z 2016-01-01 00:19:00 00:01:00      2
20    20  3456    Z 2016-01-01 00:20:00 00:01:00      2
21    21  3456    Z 2016-01-01 00:21:00 00:01:00      2
22    22  3456    Z 2016-01-01 00:22:00 00:01:00      2
23    23  3456    Z 2016-01-01 00:23:00 00:01:00      2
24    24  3456    Z 2016-01-01 00:24:00 00:01:00      2
25    25  3456    Z 2016-01-01 00:25:00 00:01:00      2
26    26  3456    Z 2016-01-01 00:26:00 00:01:00      2
27    27  3456    Z 2016-01-01 00:27:00 00:01:00      2
28    28  3456    Z 2016-01-01 00:28:00 00:01:00      2
29    29  3456    Z 2016-01-01 00:29:00 00:01:00      2
30    30  3456    Z 2016-01-01 00:30:00 00:01:00      2
31    31  3456    Z 2016-01-01 00:31:00 00:01:00      2
32    32  3456    Z 2016-01-01 00:32:00 00:01:00      2
33    33  3456    Z 2016-01-01 00:33:00 00:01:00      2
34    34  3456    Z 2016-01-01 00:34:00 00:01:00      2
35    35  3456    Z 2016-01-01 00:35:00 00:01:00      2
36    36  3456    Z 2016-01-01 00:36:00 00:01:00      2
37    37  3456    Z 2016-01-01 00:37:00 00:01:00      2
38    38  3456    Z 2016-01-01 00:38:00 00:01:00      2
39    39  3456    Z 2016-01-01 00:39:00 00:01:00      2
40    40  3456    Z 2016-01-01 00:40:00 00:01:00      2
41    41  3456    Z 2016-01-01 00:41:00 00:01:00      2
42    42  3456    Z 2016-01-01 00:42:00 00:01:00      2
43    43  3456    Z 2016-01-01 00:43:00 00:01:00      2
44    44  3456    Z 2016-01-01 00:44:00 00:01:00      2
45    45  3456    Z 2016-01-01 00:45:00 00:01:00      2
46    46  3456    Z 2016-01-01 00:46:00 00:01:00      2
47    47  3456    Z 2016-01-01 00:47:00 00:01:00      2
48    48  3456    Z 2016-01-01 00:48:00 00:01:00      2
49    49  3456    Z 2016-01-01 00:49:00 00:01:00      2
50    50  3456    Z 2016-01-01 00:50:00 00:01:00      2
51    51  3456    Z 2016-01-01 00:51:00 00:01:00      2
52    52  3456    Z 2016-01-01 00:52:00 00:01:00      2
53    53  3456    Z 2016-01-01 00:53:00 00:01:00      2
54    54  3456    Z 2016-01-01 00:54:00 00:01:00      2
55    55  3456    Z 2016-01-01 00:55:00 00:01:00      2
56    56  3456    Z 2016-01-01 00:56:00 00:01:00      2
57    57  3456    Z 2016-01-01 00:57:00 00:01:00      2
58    58  3456    Z 2016-01-01 00:58:00 00:01:00      2
59    59  3456    Z 2016-01-01 00:59:00 00:01:00      2

The track breakdown is correct, but it would be nice if the track counter reset to 0 with each new (site, sat) pair.

# In[229]:

# This method works. The difference is that when I calculate the
# "track" counter I'm doing the cumulative sum on the results of 
# the groupby. It also resets the track counter with each new 
# site, sat group.

test3 = test.copy()

test3.sort_values(['site', 'sat', 'time'], inplace=True)
test3['delta_t'] = test3.groupby(['site', 'sat'])['time'].diff()    

# calculate an intermediate flag column. If you try to eliminate this
# and put the boolean test directly into the 'track' calculation pandas
# will throw a recursion error.

test3['new_track'] = test3['delta_t'] > pd.Timedelta(minutes=20)

# The to_numeric call is used to convert from a float to an integer.

test3['track'] = pd.to_numeric(test3.groupby(['site', 'sat'])['new_track'].cumsum(), downcast='integer')

my_groups = test3.groupby(by = ['site', 'sat', 'track'])
for key, g in my_groups:
    print(key)
    print(g)

which as the following output

('X', 3456, 0)
   data   sat site                time  delta_t new_track  track
0     0  3456    X 2016-01-01 00:00:00      NaT     False      0
1     1  3456    X 2016-01-01 00:00:01 00:00:01     False      0
2     2  3456    X 2016-01-01 00:00:02 00:00:01     False      0
3     3  3456    X 2016-01-01 00:00:03 00:00:01     False      0
4     4  3456    X 2016-01-01 00:00:04 00:00:01     False      0
('X', 12345, 0)
   data    sat site                time  delta_t new_track  track
0     0  12345    X 2016-01-01 00:00:00      NaT     False      0
1     1  12345    X 2016-01-01 00:00:01 00:00:01     False      0
2     2  12345    X 2016-01-01 00:00:02 00:00:01     False      0
3     3  12345    X 2016-01-01 00:00:03 00:00:01     False      0
4     4  12345    X 2016-01-01 00:00:04 00:00:01     False      0
5     5  12345    X 2016-01-01 00:00:05 00:00:01     False      0
6     6  12345    X 2016-01-01 00:00:06 00:00:01     False      0
7     7  12345    X 2016-01-01 00:00:07 00:00:01     False      0
8     8  12345    X 2016-01-01 00:00:08 00:00:01     False      0
9     9  12345    X 2016-01-01 00:00:09 00:00:01     False      0
('Y', 3456, 0)
   data   sat site                time  delta_t new_track  track
0     0  3456    Y 2016-01-01 00:01:00      NaT     False      0
1     1  3456    Y 2016-01-01 00:01:01 00:00:01     False      0
2     2  3456    Y 2016-01-01 00:01:02 00:00:01     False      0
3     3  3456    Y 2016-01-01 00:01:03 00:00:01     False      0
4     4  3456    Y 2016-01-01 00:01:04 00:00:01     False      0
5     5  3456    Y 2016-01-01 00:01:05 00:00:01     False      0
6     6  3456    Y 2016-01-01 00:01:06 00:00:01     False      0
7     7  3456    Y 2016-01-01 00:01:07 00:00:01     False      0
8     8  3456    Y 2016-01-01 00:01:08 00:00:01     False      0
9     9  3456    Y 2016-01-01 00:01:09 00:00:01     False      0
('Y', 3456, 1)
   data   sat site                time  delta_t new_track  track
0     0  3456    Y 2016-01-01 01:00:00 00:58:51      True      1
1     1  3456    Y 2016-01-01 01:00:01 00:00:01     False      1
2     2  3456    Y 2016-01-01 01:00:02 00:00:01     False      1
3     3  3456    Y 2016-01-01 01:00:03 00:00:01     False      1
4     4  3456    Y 2016-01-01 01:00:04 00:00:01     False      1
('Y', 3456, 2)
   data   sat site                time  delta_t new_track  track
0     0  3456    Y 2016-01-02 00:00:00 22:59:56      True      2
1     1  3456    Y 2016-01-02 00:01:00 00:01:00     False      2
2     2  3456    Y 2016-01-02 00:02:00 00:01:00     False      2
3     3  3456    Y 2016-01-02 00:03:00 00:01:00     False      2
4     4  3456    Y 2016-01-02 00:04:00 00:01:00     False      2
('Z', 3456, 0)
    data   sat site                time  delta_t new_track  track
0      0  3456    Z 2016-01-01 00:00:00      NaT     False      0
1      1  3456    Z 2016-01-01 00:01:00 00:01:00     False      0
2      2  3456    Z 2016-01-01 00:02:00 00:01:00     False      0
3      3  3456    Z 2016-01-01 00:03:00 00:01:00     False      0
4      4  3456    Z 2016-01-01 00:04:00 00:01:00     False      0
5      5  3456    Z 2016-01-01 00:05:00 00:01:00     False      0
6      6  3456    Z 2016-01-01 00:06:00 00:01:00     False      0
7      7  3456    Z 2016-01-01 00:07:00 00:01:00     False      0
8      8  3456    Z 2016-01-01 00:08:00 00:01:00     False      0
9      9  3456    Z 2016-01-01 00:09:00 00:01:00     False      0
10    10  3456    Z 2016-01-01 00:10:00 00:01:00     False      0
11    11  3456    Z 2016-01-01 00:11:00 00:01:00     False      0
12    12  3456    Z 2016-01-01 00:12:00 00:01:00     False      0
13    13  3456    Z 2016-01-01 00:13:00 00:01:00     False      0
14    14  3456    Z 2016-01-01 00:14:00 00:01:00     False      0
15    15  3456    Z 2016-01-01 00:15:00 00:01:00     False      0
16    16  3456    Z 2016-01-01 00:16:00 00:01:00     False      0
17    17  3456    Z 2016-01-01 00:17:00 00:01:00     False      0
18    18  3456    Z 2016-01-01 00:18:00 00:01:00     False      0
19    19  3456    Z 2016-01-01 00:19:00 00:01:00     False      0
20    20  3456    Z 2016-01-01 00:20:00 00:01:00     False      0
21    21  3456    Z 2016-01-01 00:21:00 00:01:00     False      0
22    22  3456    Z 2016-01-01 00:22:00 00:01:00     False      0
23    23  3456    Z 2016-01-01 00:23:00 00:01:00     False      0
24    24  3456    Z 2016-01-01 00:24:00 00:01:00     False      0
25    25  3456    Z 2016-01-01 00:25:00 00:01:00     False      0
26    26  3456    Z 2016-01-01 00:26:00 00:01:00     False      0
27    27  3456    Z 2016-01-01 00:27:00 00:01:00     False      0
28    28  3456    Z 2016-01-01 00:28:00 00:01:00     False      0
29    29  3456    Z 2016-01-01 00:29:00 00:01:00     False      0
30    30  3456    Z 2016-01-01 00:30:00 00:01:00     False      0
31    31  3456    Z 2016-01-01 00:31:00 00:01:00     False      0
32    32  3456    Z 2016-01-01 00:32:00 00:01:00     False      0
33    33  3456    Z 2016-01-01 00:33:00 00:01:00     False      0
34    34  3456    Z 2016-01-01 00:34:00 00:01:00     False      0
35    35  3456    Z 2016-01-01 00:35:00 00:01:00     False      0
36    36  3456    Z 2016-01-01 00:36:00 00:01:00     False      0
37    37  3456    Z 2016-01-01 00:37:00 00:01:00     False      0
38    38  3456    Z 2016-01-01 00:38:00 00:01:00     False      0
39    39  3456    Z 2016-01-01 00:39:00 00:01:00     False      0
40    40  3456    Z 2016-01-01 00:40:00 00:01:00     False      0
41    41  3456    Z 2016-01-01 00:41:00 00:01:00     False      0
42    42  3456    Z 2016-01-01 00:42:00 00:01:00     False      0
43    43  3456    Z 2016-01-01 00:43:00 00:01:00     False      0
44    44  3456    Z 2016-01-01 00:44:00 00:01:00     False      0
45    45  3456    Z 2016-01-01 00:45:00 00:01:00     False      0
46    46  3456    Z 2016-01-01 00:46:00 00:01:00     False      0
47    47  3456    Z 2016-01-01 00:47:00 00:01:00     False      0
48    48  3456    Z 2016-01-01 00:48:00 00:01:00     False      0
49    49  3456    Z 2016-01-01 00:49:00 00:01:00     False      0
50    50  3456    Z 2016-01-01 00:50:00 00:01:00     False      0
51    51  3456    Z 2016-01-01 00:51:00 00:01:00     False      0
52    52  3456    Z 2016-01-01 00:52:00 00:01:00     False      0
53    53  3456    Z 2016-01-01 00:53:00 00:01:00     False      0
54    54  3456    Z 2016-01-01 00:54:00 00:01:00     False      0
55    55  3456    Z 2016-01-01 00:55:00 00:01:00     False      0
56    56  3456    Z 2016-01-01 00:56:00 00:01:00     False      0
57    57  3456    Z 2016-01-01 00:57:00 00:01:00     False      0
58    58  3456    Z 2016-01-01 00:58:00 00:01:00     False      0
59    59  3456    Z 2016-01-01 00:59:00 00:01:00     False      0

So I think I have a good solution at this point. It's odd that pandas throws a recursion error if you try to combine the new_track test into the track calculation.

Upvotes: 2

Related Questions