Reputation: 141
I have an IoT device running for last 18 months and have a lot of data which I need to analysis. The device has been switched on and off at various times and I want to work out when it has been switched on using the timestamp that has the following format and each sample has been taken at one minute intervals:
08-01-01 10:00
08-01-01 10:01
08-01-01 10:00
08-01-02 03:10
08-01-02 03:11
Ideally I want to generate a report in the following format:
Time session 1 - 08-01-01 10:00 08-01-01 10:02 Session 1 ran for three minutes
Time session 2 - 08-01-02 02:10 08-01-02 03:11 Session 2 ran for 2 minutes
The problem is I have over 150k timestamps and cannot think of a quick way to sort the the data, currently I am using another array that is the complete timestamp from the very first time stamp to the last time stamp. Then comparing the original timestamp array, to the master timestamp and then set a marker. It workers but is not very efficient and trying to think of a better way to analysis this data.
import csv
from datetime import date, datetime, timedelta
with open('HomeOfficeApr.csv', 'rU') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
orgtimestamp = []
for row in readCSV:
ts = row[0]
orgtimestamp.append(ts)
for elements in range(len(orgtimestamp)):
orgtimestamp[elements]=orgtimestamp[elements][:-9]
# print(timestamp[elements])
print("First time stamp")
print(orgtimestamp[0])
print("Create time stamp range")
def datetime_range(start, end, delta):
current = start
if not isinstance(delta, timedelta):
delta = timedelta(**delta)
while current < end:
yield current
current += delta
#Timestamps hard coded - need to change to first and last timestamp
start = datetime(2017,04,13, 8, 30)
end = datetime(2018,12,31, 12, 0)
gentimestamp = []
#this unlocks the following interface:
for dt in datetime_range(start, end, {'days':0, 'minutes':1}):
gentimestamp.append(str(dt))
for i in range(len(gentimestamp)):
gentimestamp[i]=gentimestamp[i][:-3]
print("Compare time stamp")
print(len(gentimestamp))
CompareTimeStampArray = [None] * len(gentimestamp)
for i in range(len(CompareTimeStampArray)):
CompareTimeStampArray[i] = "Y"
for i in range(len(orgtimestamp)):
for y in range(len(gentimestamp)):
if (orgtimestamp[i][0:4]) == (gentimestamp[y][0:4]):
#print("Match year")
#print(orgtimestamp[i][0:4])
#print(gentimestamp[y][0:4])
if (orgtimestamp[i][5:7]) == (gentimestamp[y][5:7]):
#print("Match month")
#print(orgtimestamp[i][5:7])
#print(gentimestamp[y][5:7])
if (orgtimestamp[i][8:10]) == (gentimestamp[y][8:10]):
#print("Match day")
#print(orgtimestamp[i][8:10])
#print(gentimestamp[y][8:10])
if (orgtimestamp[i][11:13]) == (gentimestamp[y][11:13]):
#print("Match hour")
#print(orgtimestamp[i][11:13])
#print(gentimestamp[y][11:13])
if (orgtimestamp[i][14:16]) == (gentimestamp[y][14:16]):
print("Match second")
print("Date & time match")
print(orgtimestamp[i])
print(gentimestamp[y])
print[i]
print[y]
print("")
CompareTimeStampArray[i] = "X"
break
print("Finished")
Upvotes: 0
Views: 899
Reputation: 993
The pandas
library could be of help here. It lets you load csv files into a spreadsheet-like format in which you can perform column operations. It can also deal quite well with time formats. Try this:
Edit: now accounting for new input format
import pandas as pd
import numpy as np
# make up some timestamps in ascending order
stamps = ['08-01-01 10:00', '08-01-01 10:01', '08-01-01 10:02',
'08-01-02 03:10', '08-01-02 03:11', '08-02-15 13:34',
'08-03-06 09:06', '08-03-06 09:07', '08-03-06 09:08', ]
# get original timestamps into a pandas dataframe
ts = pd.DataFrame(stamps, columns=['orig_timestamp'])
# assuming that the timestamps are in year-month-day hour:minute format
ts['Timestamp'] = pd.to_datetime(ts['orig_timestamp'], format='%y-%m-%d %H:%M')
# get the timedelta between consecutive rows, set to 0 for first row
ts['Timedelta'] = ts['Timestamp'].diff().fillna(value=0)
# get the timedelta in minutes
ts['minute_delta'] = ts['Timedelta'].astype('timedelta64[m]')
# set to True whenever a new Session begins, i.e. timedelta is not one minute
ts['newSession'] = np.where(ts['minute_delta'] == 1, False, True)
# cumulative sum of session starts
ts['SessionID'] = ts['newSession'].cumsum()
# group timestamps by SessionID and count their duration
grouped_timestamps = ts[['orig_timestamp', 'SessionID']].\
groupby(['SessionID']).agg(['first', 'last', 'count'])
print(ts[['orig_timestamp', 'minute_delta', 'newSession', 'SessionID']])
print(grouped_timestamps)
The final DataFrame looks like this:
orig_timestamp minute_delta newSession SessionID
0 08-01-01 10:00 0.0 True 1
1 08-01-01 10:01 1.0 False 1
2 08-01-01 10:02 1.0 False 1
3 08-01-02 03:10 1028.0 True 2
4 08-01-02 03:11 1.0 False 2
5 08-02-15 13:34 63983.0 True 3
6 08-03-06 09:06 28532.0 True 4
7 08-03-06 09:07 1.0 False 4
8 08-03-06 09:08 1.0 False 4
Grouping produces a DataFrame with column 'count' as the number of minutes that each session ran:
orig_timestamp
first last count
SessionID
1 08-01-01 10:00 08-01-01 10:02 3
2 08-01-02 03:10 08-01-02 03:11 2
3 08-02-15 13:34 08-02-15 13:34 1
4 08-03-06 09:06 08-03-06 09:08 3
Upvotes: 2