Reputation: 1330
I'm new to both python and pandas, and after trying out a few approaches, I was hoping to illicit some suggestions from everyone on the best approaches to structure this dataset, given the goals of my analysis.
Given the following DataFrame
:
id event timestamp
1 "page 1 load" 1/1/2014 0:00:01
1 "page 1 exit" 1/1/2014 0:00:31
2 "page 2 load" 1/1/2014 0:01:01
2 "page 2 exit" 1/1/2014 0:01:31
3 "page 3 load" 1/1/2014 0:02:01
3 "page 3 exit" 1/1/2014 0:02:31
4 "page 1 load" 2/1/2014 1:00:01
4 "page 1 exit" 2/1/2014 1:00:31
5 "page 2 load" 2/1/2014 1:01:01
5 "page 2 exit" 2/1/2014 1:01:31
6 "page 3 load" 2/1/2014 1:02:01
6 "page 3 exit" 2/1/2014 1:02:31
The goal here would be to calculate time elapsed from a load to an exit. However, I first need to validate that the load and exit timestamps are indeed from the same session (id
) before computing the time elapsed. The approach I am thinking of is to process the source dataset and create a new DataFrame
where each row is a combination of already validated data, adding an elapsed column, making computation and grouping easier, like this.
id event_1 timestamp_1 event_2 timestamp_2 elapsed
1 "page 1 load" 1/1/2014 0:00:01 "page 1 exit" 1/1/2014 0:00:31 0:00:30
2 "page 2 load" 1/1/2014 0:01:01 "page 2 exit" 1/1/2014 0:01:31 0:00:30
3 "page 3 load" 1/1/2014 0:02:01 "page 3 exit" 1/1/2014 0:02:31 0:00:30
If this is a good approach? If so, what are the best methods to create this new DataFrame
?
Upvotes: 3
Views: 1993
Reputation: 375415
I would consider breaking event into seperate columns e.g. using str.extract
:
In [11]: df['event'].str.extract('page (?P<page>\d+) (?P<event>\w+)')
Out[11]:
page event
0 1 load
1 1 exit
2 2 load
3 2 exit
4 3 load
5 3 exit
6 1 load
7 1 exit
8 2 load
9 2 exit
10 3 load
11 3 exit
[12 rows x 2 columns]
If you set this to the columns:
In [12]: df[['page', 'event']] = df['event'].str.extract('page (?P<page>\d+) (?P<event>\w+)')
In [13]: df
Out[13]:
id event timestamp page
0 1 load 2014-01-01 00:00:01 1
1 1 exit 2014-01-01 00:00:31 1
2 2 load 2014-01-01 00:01:01 2
3 2 exit 2014-01-01 00:01:31 2
4 3 load 2014-01-01 00:02:01 3
5 3 exit 2014-01-01 00:02:31 3
6 4 load 2014-02-01 01:00:01 1
7 4 exit 2014-02-01 01:00:31 1
8 5 load 2014-02-01 01:01:01 2
9 5 exit 2014-02-01 01:01:31 2
10 6 load 2014-02-01 01:02:01 3
11 6 exit 2014-02-01 01:02:31 3
[12 rows x 4 columns]
I think this is now much easier to reason about...
Then you can groupby the session id and the page:
In [14]: g = df.groupby(['id', 'page']) # perhaps use as_index=False
One way to get the desired result is to use an apply e.g. via the following function:
def get_load_and_exit(x):
assert len(x) == 2 # otherwise, logic should be about pairs (?)
return pd.Series(x['timestamp'].values, x['event'])
In [16]: res = g.apply(get_load_and_exit)
In [17]: res
Out[17]:
event load exit
id page
1 1 2014-01-01 00:00:01 2014-01-01 00:00:31
2 2 2014-01-01 00:01:01 2014-01-01 00:01:31
3 3 2014-01-01 00:02:01 2014-01-01 00:02:31
4 1 2014-02-01 01:00:01 2014-02-01 01:00:31
5 2 2014-02-01 01:01:01 2014-02-01 01:01:31
6 3 2014-02-01 01:02:01 2014-02-01 01:02:31
[6 rows x 2 columns]
Assuming these are Timestamps*, you can simply subtract the columns:
In [18]: res['duration'] = res['exit'] - res['load']
In [19]: res
Out[19]:
event load exit duration
id page
1 1 2014-01-01 00:00:01 2014-01-01 00:00:31 00:00:30
2 2 2014-01-01 00:01:01 2014-01-01 00:01:31 00:00:30
3 3 2014-01-01 00:02:01 2014-01-01 00:02:31 00:00:30
4 1 2014-02-01 01:00:01 2014-02-01 01:00:31 00:00:30
5 2 2014-02-01 01:01:01 2014-02-01 01:01:31 00:00:30
6 3 2014-02-01 01:02:01 2014-02-01 01:02:31 00:00:30
[6 rows x 3 columns]
*It's worth ensuring the timestamp column is actually a datetime column (atm it's not!):
df['timestamp'] = pd.to_datetime(df['timestamp'])
Upvotes: 4