Ami Tavory
Ami Tavory

Reputation: 76346

Is There Complete Overlap Between `pd.pivot_table` and `pd.DataFrame.groupby` + `pd.DataFrame.unstack`?

(Please note that there's a question Pandas: group by and Pivot table difference, but this question is different.)

Suppose you start with a DataFrame

df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]: 
   a  b  val
0  x  0    0
1  x  1    1
2  y  0    2
3  y  1    3

Now suppose you want to make the index a, the columns b, the values in a cell val, and specify what to do if there are two or more values in a resulting cell:

b  0  1
a      
x  0  1
y  2  3

Then you can do this either through

df.val.groupby([df.a, df.b]).sum().unstack()

or through

pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')

So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). I also thought of more complex cases with hierarchical indices / columns, but I still see no difference.

Is there something I've missed?

Upvotes: 6

Views: 777

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

If I understood the source code for pivot_table(index, columns, values, aggfunc) correctly it's tuned up equivalent for:

df.groupby([index + columns]).agg(aggfunc).unstack(columns)

plus:

  • margins (subtotals and grand totals as @ayhan has already said)
  • pivot_table() also removes extra multi-levels from columns axis (see example below)
  • convenient dropna parameter: Do not include columns whose entries are all NaN

Demo: (I took this DF from the docstring [source code for pivot_table()])

In [40]: df
Out[40]:
     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7

In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
          sum        mean
C       large small large small
A   B
bar one   4.0   5.0   4.0   5.0
    two   7.0   6.0   7.0   6.0
foo one   4.0   1.0   2.0   1.0
    two   NaN   6.0   NaN   3.0

pay attention at the top level column: D

In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
            D
          sum        mean
C       large small large small
A   B
bar one   4.0   5.0   4.0   5.0
    two   7.0   6.0   7.0   6.0
foo one   4.0   1.0   2.0   1.0
    two   NaN   6.0   NaN   3.0

why not deprecate pivot_tale? groupby seems much more general.

IMO, because it's very easy to use and very convenient! ;)

Upvotes: 6

Related Questions