validate rows before computing values using python/pandas DataFrame

Question

I'm new to both python and pandas, and after trying out a few approaches, I was hoping to illicit some suggestions from everyone on the best approaches to structure this dataset, given the goals of my analysis.

Given the following DataFrame:

id          event          timestamp
1   "page 1 load"   1/1/2014 0:00:01
1   "page 1 exit"   1/1/2014 0:00:31
2   "page 2 load"   1/1/2014 0:01:01
2   "page 2 exit"   1/1/2014 0:01:31
3   "page 3 load"   1/1/2014 0:02:01
3   "page 3 exit"   1/1/2014 0:02:31
4   "page 1 load"   2/1/2014 1:00:01
4   "page 1 exit"   2/1/2014 1:00:31
5   "page 2 load"   2/1/2014 1:01:01
5   "page 2 exit"   2/1/2014 1:01:31
6   "page 3 load"   2/1/2014 1:02:01
6   "page 3 exit"   2/1/2014 1:02:31

The goal here would be to calculate time elapsed from a load to an exit. However, I first need to validate that the load and exit timestamps are indeed from the same session (id) before computing the time elapsed. The approach I am thinking of is to process the source dataset and create a new DataFrame where each row is a combination of already validated data, adding an elapsed column, making computation and grouping easier, like this.

id      event_1          timestamp_1        event_2       timestamp_2   elapsed
1   "page 1 load"   1/1/2014 0:00:01    "page 1 exit"    1/1/2014 0:00:31   0:00:30
2   "page 2 load"   1/1/2014 0:01:01    "page 2 exit"    1/1/2014 0:01:31   0:00:30
3   "page 3 load"   1/1/2014 0:02:01    "page 3 exit"    1/1/2014 0:02:31   0:00:30

If this is a good approach? If so, what are the best methods to create this new DataFrame?

Andy Hayden · Accepted Answer

I would consider breaking event into seperate columns e.g. using str.extract:

In [11]: df['event'].str.extract('page (?P\d+) (?P\w+)')
Out[11]: 
   page event
0     1  load
1     1  exit
2     2  load
3     2  exit
4     3  load
5     3  exit
6     1  load
7     1  exit
8     2  load
9     2  exit
10    3  load
11    3  exit

[12 rows x 2 columns]

If you set this to the columns:

In [12]: df[['page', 'event']] = df['event'].str.extract('page (?P\d+) (?P\w+)')

In [13]: df
Out[13]: 
    id event           timestamp page
0    1  load 2014-01-01 00:00:01    1
1    1  exit 2014-01-01 00:00:31    1
2    2  load 2014-01-01 00:01:01    2
3    2  exit 2014-01-01 00:01:31    2
4    3  load 2014-01-01 00:02:01    3
5    3  exit 2014-01-01 00:02:31    3
6    4  load 2014-02-01 01:00:01    1
7    4  exit 2014-02-01 01:00:31    1
8    5  load 2014-02-01 01:01:01    2
9    5  exit 2014-02-01 01:01:31    2
10   6  load 2014-02-01 01:02:01    3
11   6  exit 2014-02-01 01:02:31    3

[12 rows x 4 columns]

I think this is now much easier to reason about...

Then you can groupby the session id and the page:

In [14]: g = df.groupby(['id', 'page'])  # perhaps use as_index=False

One way to get the desired result is to use an apply e.g. via the following function:

def get_load_and_exit(x):
    assert len(x) == 2  # otherwise, logic should be about pairs (?)
    return pd.Series(x['timestamp'].values, x['event'])

In [16]: res = g.apply(get_load_and_exit)

In [17]: res
Out[17]: 
event                  load                exit
id page                                        
1  1    2014-01-01 00:00:01 2014-01-01 00:00:31
2  2    2014-01-01 00:01:01 2014-01-01 00:01:31
3  3    2014-01-01 00:02:01 2014-01-01 00:02:31
4  1    2014-02-01 01:00:01 2014-02-01 01:00:31
5  2    2014-02-01 01:01:01 2014-02-01 01:01:31
6  3    2014-02-01 01:02:01 2014-02-01 01:02:31

[6 rows x 2 columns]

Assuming these are Timestamps*, you can simply subtract the columns:

In [18]: res['duration'] = res['exit'] - res['load']

In [19]: res
Out[19]: 
event                  load                exit  duration
id page                                                  
1  1    2014-01-01 00:00:01 2014-01-01 00:00:31  00:00:30
2  2    2014-01-01 00:01:01 2014-01-01 00:01:31  00:00:30
3  3    2014-01-01 00:02:01 2014-01-01 00:02:31  00:00:30
4  1    2014-02-01 01:00:01 2014-02-01 01:00:31  00:00:30
5  2    2014-02-01 01:01:01 2014-02-01 01:01:31  00:00:30
6  3    2014-02-01 01:02:01 2014-02-01 01:02:31  00:00:30

[6 rows x 3 columns]

*It's worth ensuring the timestamp column is actually a datetime column (atm it's not!):

df['timestamp'] = pd.to_datetime(df['timestamp'])

validate rows before computing values using python/pandas DataFrame

Answers (1)

Related Questions