ADITYA AWALKAR
ADITYA AWALKAR

Reputation: 117

Timestamp String to Seconds Conversion in Python

I'm a beginner in Python Data Science. I'm working on clickstream data and want to find out the duration of a session. For that I find the start time and end time of the session. However on subtraction, I'm getting wrong answer for the same. Here is the data

       Sid                    Tstamp     Itemid  Category
0        1  2014-04-07T10:51:09.277Z  214536502         0
1        1  2014-04-07T10:54:09.868Z  214536500         0
2        1  2014-04-07T10:54:46.998Z  214536506         0
3        1  2014-04-07T10:57:00.306Z  214577561         0
4        2  2014-04-07T13:56:37.614Z  214662742         0
5        2  2014-04-07T13:57:19.373Z  214662742         0
6        2  2014-04-07T13:58:37.446Z  214825110         0
7        2  2014-04-07T13:59:50.710Z  214757390         0
8        2  2014-04-07T14:00:38.247Z  214757407         0
9        2  2014-04-07T14:02:36.889Z  214551617         0
10       3  2014-04-02T13:17:46.940Z  214716935         0
11       3  2014-04-02T13:26:02.515Z  214774687         0
12       3  2014-04-02T13:30:12.318Z  214832672         0

I referred this question for the code- Timestamp Conversion

Here is my code-

k.columns=['Sid','Tstamp','Itemid','Category']
k=k.loc[:,('Sid','Tstamp')]

#Find max timestamp
idx=k.groupby(['Sid'])['Tstamp'].transform(max) == k['Tstamp'] 
ah=k[idx].reset_index()

#Find min timestamp
idy=k.groupby(['Sid'])['Tstamp'].transform(min) == k['Tstamp']
ai=k[idy].reset_index()

#grouping by Sid and applying count to retain the distinct Sid values
kgrp=k.groupby('Sid').count()

i=0
for temp1,temp2 in zip(ah['Tstamp'],ai['Tstamp']):
    sv1= datetime.datetime.strptime(temp1, "%Y-%m-%dT%H:%M:%S.%fZ")
    sv2= datetime.datetime.strptime(temp2, "%Y-%m-%dT%H:%M:%S.%fZ")
    d1=time.mktime(sv1.timetuple()) + (sv1.microsecond / 1000000.0)
    d2=time.mktime(sv2.timetuple()) + (sv2.microsecond / 1000000.0)
    kgrp.loc[i,'duration']= d1-d2   
    i=i+1

Here is the output.

 
kgrp
Out[5]: 
      Tstamp  duration
Sid                   
1          4   359.275
2          6   745.378
3          3  1034.468

For session id 2, the duration should be close to 6 minutes however I'm getting almost 12 minutes. I reckon I'm making some silly mistake here.

Also, I'm grouping by Sid and applying count on it so as to get the Sid column and store each duration as a separate column. Is there any easier method through which I can store only the Sid (not the 'Tstamp' Count Column) and its duration values?

Upvotes: 2

Views: 250

Answers (1)

Shaung
Shaung

Reputation: 508

You are assigning the duration value to the wrong label. In your test data sid starts from 1 but i starts from 0:

# for sid 1, i == 0
kgrp.loc[i,'duration']= d1-d2
i=i+1

Update

A more pythonic way to handle this :)

def calculate_duration(dt1, dt2):
    # do the calculation here, return the duration in seconds

k = k.loc[:, ('Sid', 'Tstamp')]
result = k.groupby(['Sid'])['Tstamp'].agg({
    'Duration': lambda x: calculate_duration(x.max(), x.min()),
    'Count': lambda x: x.count()
})

Upvotes: 1

Related Questions