Reputation: 117
I'm a beginner in Python Data Science. I'm working on clickstream data and want to find out the duration of a session. For that I find the start time and end time of the session. However on subtraction, I'm getting wrong answer for the same. Here is the data
Sid Tstamp Itemid Category 0 1 2014-04-07T10:51:09.277Z 214536502 0 1 1 2014-04-07T10:54:09.868Z 214536500 0 2 1 2014-04-07T10:54:46.998Z 214536506 0 3 1 2014-04-07T10:57:00.306Z 214577561 0 4 2 2014-04-07T13:56:37.614Z 214662742 0 5 2 2014-04-07T13:57:19.373Z 214662742 0 6 2 2014-04-07T13:58:37.446Z 214825110 0 7 2 2014-04-07T13:59:50.710Z 214757390 0 8 2 2014-04-07T14:00:38.247Z 214757407 0 9 2 2014-04-07T14:02:36.889Z 214551617 0 10 3 2014-04-02T13:17:46.940Z 214716935 0 11 3 2014-04-02T13:26:02.515Z 214774687 0 12 3 2014-04-02T13:30:12.318Z 214832672 0
I referred this question for the code- Timestamp Conversion
Here is my code-
k.columns=['Sid','Tstamp','Itemid','Category']
k=k.loc[:,('Sid','Tstamp')]
#Find max timestamp
idx=k.groupby(['Sid'])['Tstamp'].transform(max) == k['Tstamp']
ah=k[idx].reset_index()
#Find min timestamp
idy=k.groupby(['Sid'])['Tstamp'].transform(min) == k['Tstamp']
ai=k[idy].reset_index()
#grouping by Sid and applying count to retain the distinct Sid values
kgrp=k.groupby('Sid').count()
i=0
for temp1,temp2 in zip(ah['Tstamp'],ai['Tstamp']):
sv1= datetime.datetime.strptime(temp1, "%Y-%m-%dT%H:%M:%S.%fZ")
sv2= datetime.datetime.strptime(temp2, "%Y-%m-%dT%H:%M:%S.%fZ")
d1=time.mktime(sv1.timetuple()) + (sv1.microsecond / 1000000.0)
d2=time.mktime(sv2.timetuple()) + (sv2.microsecond / 1000000.0)
kgrp.loc[i,'duration']= d1-d2
i=i+1
Here is the output.
kgrp Out[5]: Tstamp duration Sid 1 4 359.275 2 6 745.378 3 3 1034.468
For session id 2, the duration should be close to 6 minutes however I'm getting almost 12 minutes. I reckon I'm making some silly mistake here.
Also, I'm grouping by Sid and applying count on it so as to get the Sid column and store each duration as a separate column. Is there any easier method through which I can store only the Sid (not the 'Tstamp' Count Column) and its duration values?
Upvotes: 2
Views: 250
Reputation: 508
You are assigning the duration value to the wrong label.
In your test data sid starts from 1 but i
starts from 0:
# for sid 1, i == 0
kgrp.loc[i,'duration']= d1-d2
i=i+1
Update
A more pythonic way to handle this :)
def calculate_duration(dt1, dt2):
# do the calculation here, return the duration in seconds
k = k.loc[:, ('Sid', 'Tstamp')]
result = k.groupby(['Sid'])['Tstamp'].agg({
'Duration': lambda x: calculate_duration(x.max(), x.min()),
'Count': lambda x: x.count()
})
Upvotes: 1