Reputation: 6344
I just started learning Pandas and was wondering if there is any difference between groupby
and pivot_table
functions. Can anyone help me understand the difference between them?
Upvotes: 112
Views: 86085
Reputation: 23271
groupby
and pivot_table
are related as follows (in fact, under the hood pivot_table
is defined using groupby
)1:
In particular, if columns
parameter of pivot_table()
is not used, then groupby()
and pivot_table()
both produce the same result (if the same aggregator function is used).
# sample
df = pd.DataFrame({"a": [1,1,1,2,2,2], "b": [1,1,2,2,3,3], "c": [0,0.5,1,1,2,2]})
# example
gb = df.groupby(['a','b'])[['c']].sum()
pt = df.pivot_table(index=['a','b'], values=['c'], aggfunc='sum')
# equality test
gb.equals(pt) # True <--- no `columns=` kwarg, no `unstack()`
If you're familiar with Microsoft Excel, both pivot_table
and groupby
behave like the PivotTable functionality in Excel:
groupby
: by
kwarg (i.e. grouper column) corresponds to ROWS, the columns being aggregated correspond to VALUES and the groupby methods (mean()
, sum()
etc.) correspond to the functions you select in Value Field Settings.pivot_table
: values
kwarg corresponds to VALUES, index
corresponds to ROWS, columns
corresponds to COLUMNS and aggfunc
corresponds to the function you select in Value Field Settings.In pandas, just like in Excel, groupby
results in a long stacked table, while pivot_table
results in a wide table. If you unstack the long table from groupby
, you get the same result as pivot_table
.
1 In general, if we check the source code, pivot_table()
internally calls __internal_pivot_table()
. This function creates a single flat list out of index and columns and calls groupby()
with this list as the grouper. Then after aggregation, calls unstack()
on the list of columns.
If columns are never passed, there is nothing to unstack on, so groupby
and pivot_table
trivially produce the same output.
A demonstration of this connection is:
gb = (
df
.groupby(['a','b'])[['c']].sum()
.unstack(['b'])
)
pt = df.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')
gb.equals(pt) # True <--- they produce the same output
As stack()
is the inverse operation of unstack()
, the following holds True as well:
pt = df.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum').stack(['b'])
gb = df.groupby(['a','b'])[['c']].sum()
pt.equals(gb) # True <--- they produce the same output
In conclusion, depending on the use case, one is more convenient than the other but they can both be used instead of the other and after correctly applying stack()
/unstack()
, both will result in the same output.
However, there's a performance difference between the two methods. In short, pivot_table()
is slower than groupby().agg().unstack()
. You can read more about it from this answer.
Upvotes: 26
Reputation: 17854
The pivot_table
is a kind of two-dimensional groupby
where you group by two variables (columns) and one variable ends up in the index and another in the columns.
Upvotes: 10
Reputation: 491
It's more appropriate to use .pivot_table()
instead of .groupby()
when you need to show aggregates with both rows and column labels.
.pivot_table()
makes it easy to create row and column labels at the same time and is preferable, even though you can get similar results using .groupby()
with few extra steps.
Upvotes: 13
Reputation: 8270
Both pivot_table
and groupby
are used to aggregate your dataframe. The difference is only with regard to the shape of the result.
Using pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
a table is created where a
is on the row axis, b
is on the column axis, and the values are the sum of c
.
Example:
df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
b 1 2
a
1 0.528470 0.484766
2 0.187277 0.144326
3 0.866832 0.650100
Using groupby
, the dimensions given are placed into columns, and rows are created for each combination of those dimensions.
In this example, we create a series of the sum of values c
, grouped by all unique combinations of a
and b
.
df.groupby(['a','b'])['c'].sum()
a b
1 1 0.528470
2 0.484766
2 1 0.187277
2 0.144326
3 1 0.866832
2 0.650100
Name: c, dtype: float64
A similar usage of groupby
is if we omit the ['c']
. In this case, it creates a dataframe (not a series) of the sums of all remaining columns grouped by unique values of a
and b
.
print df.groupby(["a","b"]).sum()
c
a b
1 1 0.528470
2 0.484766
2 1 0.187277
2 0.144326
3 1 0.866832
2 0.650100
Upvotes: 141