Reputation: 153
Let's say I have the following DataFrame, where each row represents an event performed by a certain user on a certain time:
In [1]: df
Out[1]:
time event
user
a 1 x
a 2 y
a 3 z
b 1 x
b 2 x
b 3 z
b 4 z
c 1 y
c 2 y
c 3 z
d 1 z
I would like to reshape this such that it has the following structure:
In [2]: dfm
Out[2]:
x y z
user
a 1 2 3
b 1 NaN 3
b 1 NaN 4
b 2 NaN 3
b 2 NaN 4
c NaN 1 3
c NaN 2 3
d NaN NaN 1
I currently obtain this by first creating one DataFrame per event:
In [3]: dfs = [d[['time']].rename(columns={'time': k}) for k, d in df.groupby('event')]
In [4]: dfs
Out[4]:
[ x
user
a 1
b 1
b 2, y
user
a 2
c 1
c 2, z
user
a 3
b 3
b 4
c 3
d 1]
And then calling pd.merge
multiple times:
In [5]: dfm = dfs[0]
In [5]: for d in dfs[1:]:
...: dfm = pd.merge(dfm, d, left_index=True, right_index=True, how='outer')
This works fine, but I'm wondering whether there is a better way. It would not be the first time that pandas has surprised me with some nifty function! I have tried pd.concat(dfs, axis=1)
, but that produces the following error (only last line shown):
ValueError: Shape of passed values is (1, 5), indices imply (1, 4)
I have also looked into pd.pivot_table
, but that produces one row per user and averages the timestamps. Maybe I'm overlooking something. Any help is greatly appreciated!
Upvotes: 3
Views: 259
Reputation: 11717
Below is the solution discussed in the question
import pandas as pd
from StringIO import StringIO
data = \
'user,time,event\n\
a,1,x\n\
a,2,y\n\
a,3,z\n\
b,1,x\n\
b,2,x\n\
b,3,z\n\
b,4,z\n\
c,1,y\n\
c,2,y\n\
c,3,z\n\
d,1,z\n'
df = pd.read_csv(StringIO(data), index_col='user')
dfs = [d[['time']].rename(columns={'time': k}) for k, d in df.groupby('event')]
dfm = dfs[0]
for d in dfs[1:]:
dfm = pd.merge(dfm, d, left_index=True, right_index=True, how='outer')
Upvotes: 1