Reputation: 35
I need to group together the entries in which the timestamp difference between one and the other is X amount of seconds or less than then average the value for each of them for each of the devices. In the following example I have a Data Frame with this data and I need to group by device with entries between 60 seconds from each other.
Device Timestamp Value
0 30:8c:fb:a4:b9:8b 10/26/2015 22:50:15 34
1 30:8c:fb:a4:b9:8b 10/26/2015 22:50:46 34
2 c0:ee:fb:35:ec:cd 10/26/2015 22:50:50 33
3 c0:ee:fb:35:ec:cd 10/26/2015 22:50:51 32
4 30:8c:fb:a4:b9:8b 10/26/2015 22:51:15 34
5 30:8c:fb:a4:b9:8b 10/26/2015 22:51:47 32
6 c0:ee:fb:35:ec:cd 10/26/2015 22:52:38 38
7 30:8c:fb:a4:b9:8b 10/26/2015 22:54:46 34
This should be the resulting DataFrame
Device First_seen Last_seen Average_value
0 30:8c:fb:a4:b9:8b 10/26/2015 22:50:15 10/26/2015 22:51:47 33,5
1 c0:ee:fb:35:ec:cd 10/26/2015 22:50:50 10/26/2015 22:50:51 32,5
2 c0:ee:fb:35:ec:cd 10/26/2015 22:52:38 10/26/2015 22:52:38 38
3 30:8c:fb:a4:b9:8b 10/26/2015 22:54:46 10/26/2015 22:54:46 34
I have been trying to use timeGrouper but I haven’t been able to get to a working solution. Thank you very much for your help.
Upvotes: 0
Views: 357
Reputation: 880707
You could use
diffs = df.groupby(['Device'])['Timestamp'].diff()
# In [39]: diffs
# Out[39]:
# 0 NaT
# 1 00:00:31
# 2 NaT
# 3 00:00:01
# 4 00:00:29
# 5 00:00:32
# 6 00:01:47
# 7 00:02:59
# dtype: timedelta64[ns]
to calculate the difference between successive timestamps for each device group.
Note that this relies on the Timestamps being in sorted order (at least within each Device
group). If it isn't you could of course sort the rows by Timestamp
first, (e.g. df = df.sort('Timestamp')
)
Then create a boolean mask to find when the diff is more than 60 seconds:
df['gap'] = diffs > pd.Timedelta(seconds=60)
# In [42]: df['gap']
# Out[42]:
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# 7 True
# Name: gap, dtype: bool
For each device, we can use cumsum
to compute the cumulative sum of df['gap']
.
df['group'] = df.groupby(['Device'])['gap'].cumsum()
# In [45]: df['group']
# Out[45]:
# 0 0
# 1 0
# 2 0
# 3 0
# 4 0
# 5 0
# 6 1
# 7 1
# Name: group, dtype: int64
Since False is treated as 0 and True is treated as 1, the cumulative sum in effect numbers the rows within each device-group which belong to the same gap-group.
Now we can groupby both the Device
and group
columns and find the first and last Timestamp
and the mean Value
within each group:
result = df.groupby(['Device', 'group']).agg(
{'Timestamp': ['first','last'], 'Value':'mean'}):
# Timestamp Value
# first last mean
# Device group
# 30:8c:fb:a4:b9:8b 0 2015-10-26 22:50:15 2015-10-26 22:51:47 33.5
# 1 2015-10-26 22:54:46 2015-10-26 22:54:46 34.0
# c0:ee:fb:35:ec:cd 0 2015-10-26 22:50:50 2015-10-26 22:50:51 32.5
# 1 2015-10-26 22:52:38 2015-10-26 22:52:38 38.0
Putting it all together:
import pandas as pd
df = pd.DataFrame(
{'Device': {0: '30:8c:fb:a4:b9:8b',
1: '30:8c:fb:a4:b9:8b',
2: 'c0:ee:fb:35:ec:cd',
3: 'c0:ee:fb:35:ec:cd',
4: '30:8c:fb:a4:b9:8b',
5: '30:8c:fb:a4:b9:8b',
6: 'c0:ee:fb:35:ec:cd',
7: '30:8c:fb:a4:b9:8b'},
'Timestamp': {0: pd.Timestamp('2015-10-26 22:50:15'),
1: pd.Timestamp('2015-10-26 22:50:46'),
2: pd.Timestamp('2015-10-26 22:50:50'),
3: pd.Timestamp('2015-10-26 22:50:51'),
4: pd.Timestamp('2015-10-26 22:51:15'),
5: pd.Timestamp('2015-10-26 22:51:47'),
6: pd.Timestamp('2015-10-26 22:52:38'),
7: pd.Timestamp('2015-10-26 22:54:46')},
'Value': {0: 34, 1: 34, 2: 33, 3: 32, 4: 34, 5: 32, 6: 38, 7: 34}})
diffs = df.groupby(['Device'])['Timestamp'].diff()
df['gap'] = diffs > pd.Timedelta(seconds=60)
df['group'] = df.groupby(['Device'])['gap'].cumsum()
result = df.groupby(['Device', 'group']).agg({'Timestamp': ['first','last'], 'Value':'mean'})
print(result)
Upvotes: 1