Pandas DataFrame formatting to get expected output

Question

In the following Pandas DataFrame,

df = pd.DataFrame({'session' : ["1","1","2","2","3","3"],
                'path'  : ["p1","p2","p1","p2","p2","p3"], 'seconds' : ["20","21","132","10","24","45"]})

I need to get an output like the following. (Pages as columns, sessions as rows and seconds in each cell.)

session,p1,p2,p3
1,20,21,0
2,132,10,0
3,0,24,45

What I have done so far.

In [76]: wordlist = ['p1', 'p2', 'p3']
In [77]: df2 = pd.DataFrame(df.groupby('session').apply(lambda x: ','.join(x.path)))
In [78]: df2 #I have renamed the columns
Out[78]: 

                  path
        session       
        1        p1,p2
        2        p1,p2
        3        p2,p3

In [79]: df3 = pd.DataFrame(df.groupby('session').apply(lambda x: ','.join(x.seconds.astype(str))))
In [80]: df3 #I have renamed the columns
Out[80]: 
                   path
        session        
        1         20,21
        2        132,10
        3         24,45

The following just gives the boolean result. I need to get my expected output. Any help on this?

In [84]: pd.DataFrame({name : df2["path"].str.contains(name) for name in wordlist})
Out[84]: 
            p1    p2     p3
session                    
1         True  True  False
2         True  True  False
3        False  True   True

juniper- · Accepted Answer

Use a pivot table:

df.pivot(index='session', columns='path')

Then replace all the Nan's with zeros:

df2 = df1.fillna(0)

This gives you the following output:

        seconds        
path         p1  p2  p3
session                
1            20  21   0
2           132  10   0
3             0  24  45

Then you might want to drop the multiindex column:

df1.columns = df1.columns.droplevel(0)

Yielding your desired solution (sans commas):

path      p1  p2  p3
session             
1         20  21   0
2        132  10   0
3          0  24  45

Finally you can convert it to comma-separated string using StringIO:

import StringIO
s = StringIO.StringIO()
df1.to_csv(s)
print s.getvalue()

With the following output:

session,p1,p2,p3
1,20,21,0
2,132,10,0
3,0,24,45

Pandas DataFrame formatting to get expected output

Answers (1)

Related Questions