how to perform calculations across specific rows and columns of a crosstabulation in pandas?

Question

import pandas as pd
import numpy as np

c1 = np.repeat(['a','b'], [50, 50], axis=0)
c2 = list('xy'*50)
c3 = np.repeat(['G1','G2'], [50, 50], axis=0)
np.random.shuffle(c3)
c4=np.repeat([1,2], [50,50],axis=0)
np.random.shuffle(c4)
val = np.random.rand(100)

df = pd.DataFrame({'c1':c1, 'c2':c2, 'c3':c3, 'c4':c4, 'val':val})

table = pd.crosstab([df.c1,df.c2],[df.c3,df.c4])
c3     G1      G2    
c4      1   2   1   2
c1 c2                
a  x    3  11   5   6
   y    9   5   7   4
b  x    5   7  11   2
   y    5   5   5  10

for each group (G1, G2), is it possible to compute ax - bx and ay - by only for c4==2 and have the result in a data frame?:

x G1  4
y G1  0
x G2  4
y G2 -6

EDIT: and how could I do this if the df was in this format?:

c1 = np.repeat(['a','b'], [8, 8], axis=0)
c2 = list('xxxxyyyyxxxxyyyy')
c3 = ['G1','G1','G2','G2','G1','G1','G2','G2','G1','G1','G2','G2','G1','G1','G2','G2']
c4 = [1,2]*8
val = np.random.rand(16)
df = pd.DataFrame({'c1':c1,'c2':c2,'c3':c3,'c4':c4,'val':val})

Phillip Cloud · Accepted Answer

You can do this:

In [6]: table
Out[6]:
c3     G1      G2
c4      1   2   1  2
c1 c2
a  x    6   5   8  6
   y    9   4   5  7
b  x    5  10   4  6
   y    7   4   6  8

In [7]: g = table.xs(2, level='c4', axis=1)

In [8]: g
Out[8]:
c3     G1  G2
c1 c2
a  x    5   6
   y    4   7
b  x   10   6
   y    4   8

In [9]: g.groupby(level='c2').apply(lambda x: x.iloc[0] - x.iloc[1])
Out[9]:
c3  G1  G2
c2
x   -5   0
y    0  -1

Alternatively, pass as_index=False to groupby and use loc in the lambda, which is a bit more meaningful IMHO since you're indexing by name rather than integer location:

In [11]: g.groupby(level='c2', as_index=False).apply(lambda x: x.loc['a'] - x.loc['b'])
Out[11]:
c3  G1  G2
c2
x   -5   0
y    0  -1

as_index and apply will only work on pandas git master. If you're not using master, then you'll get the following:

In [12]: r = g.groupby(level='c2').apply(lambda x: x.loc['a'] - x.loc['b'])

In [13]: r
Out[13]:
c3     G1  G2
c2 c2
x  x   -5   0
y  y    0  -1

You can remove the duplicate index by reassigning the index attribute of r:

In [28]: r.index = r.index.droplevel(0)

In [29]: r
Out[29]:
c3  G1  G2
c2
x   -5   0
y    0  -1

EDIT: If instead you have a "molten" DataFrame, do this:

In [28]: df
Out[28]:
   c1 c2  c3  c4    val
0   a  x  G1   1  0.244
1   a  x  G1   2  0.572
2   a  x  G2   1  0.837
3   a  x  G2   2  0.893
4   a  y  G1   1  0.951
5   a  y  G1   2  0.400
6   a  y  G2   1  0.391
7   a  y  G2   2  0.237
8   b  x  G1   1  0.904
9   b  x  G1   2  0.811
10  b  x  G2   1  0.536
11  b  x  G2   2  0.736
12  b  y  G1   1  0.546
13  b  y  G1   2  0.159
14  b  y  G2   1  0.735
15  b  y  G2   2  0.772

In [29]: g2 = df[df.c4 == 2]

In [30]: g2
Out[30]:
   c1 c2  c3  c4    val
1   a  x  G1   2  0.572
3   a  x  G2   2  0.893
5   a  y  G1   2  0.400
7   a  y  G2   2  0.237
9   b  x  G1   2  0.811
11  b  x  G2   2  0.736
13  b  y  G1   2  0.159
15  b  y  G2   2  0.772

In [31]: gb = g2.groupby(['c2', 'c3'])

In [32]: sub = gb.apply(lambda x: x.val.iloc[0] - x.val.iloc[1])

In [33]: sub
Out[33]:
c2  c3
x   G1   -0.239
    G2    0.157
y   G1    0.241
    G2   -0.535
dtype: float64

In [34]: sub.unstack()
Out[34]:
c3     G1     G2
c2
x  -0.239  0.157
y   0.241 -0.535

Whenever I'm unsure about how the groups look in a groupby operation, I'll iterate over the groupby and print out its constituents:

In [40]: for _, x in g2.groupby(['c2', 'c3']):
   ....:     print x
   ....:     print
   ....:
  c1 c2  c3  c4    val
1  a  x  G1   2  0.572
9  b  x  G1   2  0.811

   c1 c2  c3  c4    val
3   a  x  G2   2  0.893
11  b  x  G2   2  0.736

   c1 c2  c3  c4    val
5   a  y  G1   2  0.400
13  b  y  G1   2  0.159

   c1 c2  c3  c4    val
7   a  y  G2   2  0.237
15  b  y  G2   2  0.772

These are the xs in lambda x: ... that is passed to groupby.apply().

how to perform calculations across specific rows and columns of a crosstabulation in pandas?

Answers (1)

Related Questions