Cartesian product of more than two DataFrames that contain duplicate indices

Question

Let's say I have the following DataFrame, where each row represents an event performed by a certain user on a certain time:

In [1]: df
Out[1]: 
      time event
user            
a        1     x
a        2     y
a        3     z
b        1     x
b        2     x
b        3     z
b        4     z
c        1     y
c        2     y
c        3     z
d        1     z

I would like to reshape this such that it has the following structure:

In [2]: dfm
Out[2]: 
       x   y  z
user           
a      1   2  3
b      1 NaN  3
b      1 NaN  4
b      2 NaN  3
b      2 NaN  4
c    NaN   1  3
c    NaN   2  3
d    NaN NaN  1

I currently obtain this by first creating one DataFrame per event:

In [3]: dfs = [d[['time']].rename(columns={'time': k}) for k, d in df.groupby('event')]

In [4]: dfs
Out[4]: 
[      x
 user   
 a     1
 b     1
 b     2,       y
 user   
 a     2
 c     1
 c     2,       z
 user   
 a     3
 b     3
 b     4
 c     3
 d     1]

And then calling pd.merge multiple times:

In [5]: dfm = dfs[0]

In [5]: for d in dfs[1:]:
   ...:     dfm = pd.merge(dfm, d, left_index=True, right_index=True, how='outer')

This works fine, but I'm wondering whether there is a better way. It would not be the first time that pandas has surprised me with some nifty function! I have tried pd.concat(dfs, axis=1), but that produces the following error (only last line shown):

ValueError: Shape of passed values is (1, 5), indices imply (1, 4)

I have also looked into pd.pivot_table, but that produces one row per user and averages the timestamps. Maybe I'm overlooking something. Any help is greatly appreciated!

Guillaume Jacquenot · Accepted Answer

Below is the solution discussed in the question

import pandas as pd
from StringIO import StringIO

data = \
'user,time,event
\
a,1,x
\
a,2,y
\
a,3,z
\
b,1,x
\
b,2,x
\
b,3,z
\
b,4,z
\
c,1,y
\
c,2,y
\
c,3,z
\
d,1,z
'

df = pd.read_csv(StringIO(data), index_col='user')
dfs = [d[['time']].rename(columns={'time': k}) for k, d in df.groupby('event')]
dfm = dfs[0]
for d in dfs[1:]:
    dfm = pd.merge(dfm, d, left_index=True, right_index=True, how='outer')

Cartesian product of more than two DataFrames that contain duplicate indices

Answers (1)

Related Questions