Rodrigo Fuentes
Rodrigo Fuentes

Reputation: 35

Group DataFrame by entries X amount of time from each other

I need to group together the entries in which the timestamp difference between one and the other is X amount of seconds or less than then average the value for each of them for each of the devices. In the following example I have a Data Frame with this data and I need to group by device with entries between 60 seconds from each other.

              Device            Timestamp  Value
0  30:8c:fb:a4:b9:8b  10/26/2015 22:50:15     34
1  30:8c:fb:a4:b9:8b  10/26/2015 22:50:46     34
2  c0:ee:fb:35:ec:cd  10/26/2015 22:50:50     33
3  c0:ee:fb:35:ec:cd  10/26/2015 22:50:51     32
4  30:8c:fb:a4:b9:8b  10/26/2015 22:51:15     34
5  30:8c:fb:a4:b9:8b  10/26/2015 22:51:47     32
6  c0:ee:fb:35:ec:cd  10/26/2015 22:52:38     38
7  30:8c:fb:a4:b9:8b  10/26/2015 22:54:46     34

This should be the resulting DataFrame

              Device           First_seen            Last_seen Average_value
0  30:8c:fb:a4:b9:8b  10/26/2015 22:50:15  10/26/2015 22:51:47          33,5
1  c0:ee:fb:35:ec:cd  10/26/2015 22:50:50  10/26/2015 22:50:51          32,5
2  c0:ee:fb:35:ec:cd  10/26/2015 22:52:38  10/26/2015 22:52:38            38
3  30:8c:fb:a4:b9:8b  10/26/2015 22:54:46  10/26/2015 22:54:46            34

I have been trying to use timeGrouper but I haven’t been able to get to a working solution. Thank you very much for your help.

Upvotes: 0

Views: 357

Answers (1)

unutbu
unutbu

Reputation: 880707

You could use

diffs = df.groupby(['Device'])['Timestamp'].diff()
# In [39]: diffs
# Out[39]: 
# 0        NaT
# 1   00:00:31
# 2        NaT
# 3   00:00:01
# 4   00:00:29
# 5   00:00:32
# 6   00:01:47
# 7   00:02:59
# dtype: timedelta64[ns]

to calculate the difference between successive timestamps for each device group. Note that this relies on the Timestamps being in sorted order (at least within each Device group). If it isn't you could of course sort the rows by Timestamp first, (e.g. df = df.sort('Timestamp'))

Then create a boolean mask to find when the diff is more than 60 seconds:

df['gap'] = diffs > pd.Timedelta(seconds=60)
# In [42]: df['gap']
# Out[42]: 
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# 7     True
# Name: gap, dtype: bool

For each device, we can use cumsum to compute the cumulative sum of df['gap'].

df['group'] = df.groupby(['Device'])['gap'].cumsum()
# In [45]: df['group']
# Out[45]: 
# 0    0
# 1    0
# 2    0
# 3    0
# 4    0
# 5    0
# 6    1
# 7    1
# Name: group, dtype: int64

Since False is treated as 0 and True is treated as 1, the cumulative sum in effect numbers the rows within each device-group which belong to the same gap-group.

Now we can groupby both the Device and group columns and find the first and last Timestamp and the mean Value within each group:

result = df.groupby(['Device', 'group']).agg(
             {'Timestamp': ['first','last'], 'Value':'mean'}):

#                                   Timestamp                     Value
#                                       first                last  mean
# Device            group                                              
# 30:8c:fb:a4:b9:8b 0     2015-10-26 22:50:15 2015-10-26 22:51:47  33.5
#                   1     2015-10-26 22:54:46 2015-10-26 22:54:46  34.0
# c0:ee:fb:35:ec:cd 0     2015-10-26 22:50:50 2015-10-26 22:50:51  32.5
#                   1     2015-10-26 22:52:38 2015-10-26 22:52:38  38.0

Putting it all together:

import pandas as pd

df = pd.DataFrame(
    {'Device': {0: '30:8c:fb:a4:b9:8b',
                1: '30:8c:fb:a4:b9:8b',
                2: 'c0:ee:fb:35:ec:cd',
                3: 'c0:ee:fb:35:ec:cd',
                4: '30:8c:fb:a4:b9:8b',
                5: '30:8c:fb:a4:b9:8b',
                6: 'c0:ee:fb:35:ec:cd',
                7: '30:8c:fb:a4:b9:8b'},
     'Timestamp': {0: pd.Timestamp('2015-10-26 22:50:15'),
                   1: pd.Timestamp('2015-10-26 22:50:46'),
                   2: pd.Timestamp('2015-10-26 22:50:50'),
                   3: pd.Timestamp('2015-10-26 22:50:51'),
                   4: pd.Timestamp('2015-10-26 22:51:15'),
                   5: pd.Timestamp('2015-10-26 22:51:47'),
                   6: pd.Timestamp('2015-10-26 22:52:38'),
                   7: pd.Timestamp('2015-10-26 22:54:46')},
     'Value': {0: 34, 1: 34, 2: 33, 3: 32, 4: 34, 5: 32, 6: 38, 7: 34}})

diffs = df.groupby(['Device'])['Timestamp'].diff()
df['gap'] = diffs > pd.Timedelta(seconds=60)
df['group'] = df.groupby(['Device'])['gap'].cumsum()
result = df.groupby(['Device', 'group']).agg({'Timestamp': ['first','last'], 'Value':'mean'})
print(result)

Upvotes: 1

Related Questions