Reputation: 76346
(Please note that there's a question Pandas: group by and Pivot table difference, but this question is different.)
Suppose you start with a DataFrame
df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]:
a b val
0 x 0 0
1 x 1 1
2 y 0 2
3 y 1 3
Now suppose you want to make the index a
, the columns b
, the values in a cell val
, and specify what to do if there are two or more values in a resulting cell:
b 0 1
a
x 0 1
y 2 3
Then you can do this either through
df.val.groupby([df.a, df.b]).sum().unstack()
or through
pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')
So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). I also thought of more complex cases with hierarchical indices / columns, but I still see no difference.
Is there something I've missed?
Are there operations that can be performed using one and not the other?
Are there, perhaps, operations easier to perform using one over the other?
If not, why not deprecate pivot_tale
? groupby
seems much more general.
Upvotes: 6
Views: 777
Reputation: 210882
If I understood the source code for pivot_table(index, columns, values, aggfunc)
correctly it's tuned up equivalent for:
df.groupby([index + columns]).agg(aggfunc).unstack(columns)
plus:
pivot_table()
also removes extra multi-levels from columns axis (see example below)dropna
parameter: Do not include columns whose entries are all NaNDemo: (I took this DF from the docstring [source code for pivot_table()
])
In [40]: df
Out[40]:
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
pay attention at the top level column: D
In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
D
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
why not deprecate pivot_tale? groupby seems much more general.
IMO, because it's very easy to use and very convenient! ;)
Upvotes: 6