Aggregating Column Values Into A Sequence Using Pandas

Question

I have a timeseries dataset which looks a bit like

dt         userid var1 var2
2013-04-27 100001   a   50
2013-04-27 100002   a   45
2013-04-28 100001   b   10
2013-04-29 100001   c   20
2013-04-29 100002   b   15

I'd like to extract the event sequence for each user - into a new DataFrame similar to

userid activity_seq
100001    [a, b, c]
100002       [a, b]

the order of the var1 in the activity_seq column follows the dt column (i.e. in chronological order). Could someone please suggest how this can be achieved using pandas?

Thanks in advance.

Here's a snippet to recreate the original DataFrame,

import pandas as pd

l1 = ['2013-04-27', '2013-04-27', '2013-04-28', '2013-04-29', '2013-04-29']
l2 = [100001, 100002, 100001, 100001, 100002]
l3 = ['a','a','b','c','b']
l4 = [50, 45, 10, 20, 15]
df = pd.DataFrame({'dt':l1, 'userid':l2, 'var1':l3, 'var2':l4})

EdChum · Accepted Answer

You can groupby on 'userid' and then on 'var1' col call apply and pass list to create a sequence, you can rename/reset if required.

In [58]:
df.groupby('userid')['var1'].apply(list)

Out[58]:
userid
100001    [a, b, c]
100002       [a, b]
Name: var1, dtype: object

To get desired df, call reset_index and rename:

In [62]:
df.groupby('userid')['var1'].apply(list).reset_index().rename(columns={'var1':'activity_seq'})

Out[62]:
   userid activity_seq
0  100001    [a, b, c]
1  100002       [a, b]

Aggregating Column Values Into A Sequence Using Pandas

Answers (1)

Related Questions