mbadd
mbadd

Reputation: 967

Pandas groupby using time window from select rows

I have some timeseries data where the IDs of rows with type 'RX' or 'CRC' have unfortunately wrapped around - for simplicity in this example, it wraps around after reaching '1'. This means I end up with multiple 0 and 1 ids when they should match type 'TX' from 0 - 4. 'TX' is always correct.

What I want to do is group by a 10ms time window from each TX row, and fix the id column based on the id value in the TX row. i.e...

From:

                       time type  id
 0  2020-01-01 10:33:00.000   TX   0
 1  2020-01-01 10:34:00.500   TX   1
 2  2020-01-01 10:34:00.000   TX   2
 3  2020-01-01 10:34:00.007  CRC   2
 4  2020-01-01 10:34:00.009   RX   2
 5  2020-01-01 10:34:00.027   RX   2
 6  2020-01-01 10:34:00.047   RX   2
 7  2020-01-01 10:34:00.012   TX   3
 8  2020-01-01 10:34:00.013   RX   3
 9  2020-01-01 10:34:00.038   RX   3
 10 2020-01-01 10:34:00.026   TX   4
 11 2020-01-01 10:34:00.036   TX   5
 12 2020-01-01 10:34:00.041   TX   6

To:

                  time type  id
 0  2020-01-01 10:33:00.000   TX   0
 1  2020-01-01 10:34:00.500   TX   1
 2  2020-01-01 10:34:00.000   TX   2
 3  2020-01-01 10:34:00.007  CRC   2
 4  2020-01-01 10:34:00.009   RX   2
 5  2020-01-01 10:34:00.012   TX   3
 6  2020-01-01 10:34:00.013   RX   3
 7  2020-01-01 10:34:00.026   TX   4
 8  2020-01-01 10:34:00.027   RX   4
 9  2020-01-01 10:34:00.036   TX   5
 10 2020-01-01 10:34:00.038   RX   5
 11 2020-01-01 10:34:00.041   TX   6
 12 2020-01-01 10:34:00.047   RX   6

MWE:

s1 = pd.to_datetime(['20200101 10:33:00.000',
                     '20200101 10:34:00.500',
                     '20200101 10:34:00.000', '20200101 10:34:00.007', '20200101 10:34:00.009', '20200101 10:34:00.027', '20200101 10:34:00.047',
                     '20200101 10:34:00.012', '20200101 10:34:00.013', '20200101 10:34:00.038',
                     '20200101 10:34:00.026',
                     '20200101 10:34:00.036',
                     '20200101 10:34:00.041'])
d1 = {'time': s1,
      'type': ['TX', 'TX', 'TX', 'CRC', 'RX', 'RX', 'RX', 'TX', 'RX', 'RX', 'TX', 'TX', 'TX'],
      'id': [0, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 5, 6]}
df1 = pd.DataFrame(data=d1)
print(df1)

s2 = pd.to_datetime(['20200101 10:33:00.000',
                     '20200101 10:34:00.500',
                     '20200101 10:34:00.000', '20200101 10:34:00.007', '20200101 10:34:00.009',
                     '20200101 10:34:00.012', '20200101 10:34:00.013',
                     '20200101 10:34:00.026', '20200101 10:34:00.027',
                     '20200101 10:34:00.036', '20200101 10:34:00.038',
                     '20200101 10:34:00.041', '20200101 10:34:00.047'])
d2 = {'time': s2,
      'type': ['TX', 'TX', 'TX', 'CRC', 'RX', 'TX', 'RX', 'TX', 'RX', 'TX', 'RX', 'TX', 'RX'],
      'id': [0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6]}
df2 = pd.DataFrame(data=d2)
print(df2)

Upvotes: 1

Views: 63

Answers (1)

ASGM
ASGM

Reputation: 11391

This looks like it would solve your problem:

df1 = df1.sort_values('time')
df1['id'] = (df1['type'] == 'TX').cumsum() - 1

Output:

                      time type  id
0  2020-01-01 10:34:00.000   TX   0
1  2020-01-01 10:34:00.007  CRC   0
2  2020-01-01 10:34:00.009   RX   0
5  2020-01-01 10:34:00.012   TX   1
6  2020-01-01 10:34:00.013   RX   1
8  2020-01-01 10:34:00.026   TX   2
3  2020-01-01 10:34:00.027   RX   2
9  2020-01-01 10:34:00.036   TX   3
7  2020-01-01 10:34:00.038   RX   3
10 2020-01-01 10:34:00.041   TX   4
4  2020-01-01 10:34:00.047   RX   4

Upvotes: 1

Related Questions