user4943236
user4943236

Reputation: 6344

Difference between groupby and pivot_table for pandas dataframes

I just started learning Pandas and was wondering if there is any difference between groupby and pivot_table functions. Can anyone help me understand the difference between them?

Upvotes: 112

Views: 86085

Answers (4)

cottontail
cottontail

Reputation: 23271

groupby and pivot_table are related as follows (in fact, under the hood pivot_table is defined using groupby)1:

  • pivot_table = groupby + unstack
  • groupby = pivot_table + stack

In particular, if columns parameter of pivot_table() is not used, then groupby() and pivot_table() both produce the same result (if the same aggregator function is used).

# sample
df = pd.DataFrame({"a": [1,1,1,2,2,2], "b": [1,1,2,2,3,3], "c": [0,0.5,1,1,2,2]})

# example
gb = df.groupby(['a','b'])[['c']].sum()
pt = df.pivot_table(index=['a','b'], values=['c'], aggfunc='sum')

# equality test
gb.equals(pt)   # True <--- no `columns=` kwarg, no `unstack()`

If you're familiar with Microsoft Excel, both pivot_table and groupby behave like the PivotTable functionality in Excel:

  • groupby: by kwarg (i.e. grouper column) corresponds to ROWS, the columns being aggregated correspond to VALUES and the groupby methods (mean(), sum() etc.) correspond to the functions you select in Value Field Settings.
  • pivot_table: values kwarg corresponds to VALUES, index corresponds to ROWS, columns corresponds to COLUMNS and aggfunc corresponds to the function you select in Value Field Settings.

In pandas, just like in Excel, groupby results in a long stacked table, while pivot_table results in a wide table. If you unstack the long table from groupby, you get the same result as pivot_table.


1 In general, if we check the source code, pivot_table() internally calls __internal_pivot_table(). This function creates a single flat list out of index and columns and calls groupby() with this list as the grouper. Then after aggregation, calls unstack() on the list of columns.

If columns are never passed, there is nothing to unstack on, so groupby and pivot_table trivially produce the same output.

A demonstration of this connection is:

gb = (
    df
    .groupby(['a','b'])[['c']].sum()
    .unstack(['b'])
)
pt = df.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')

gb.equals(pt)   # True <--- they produce the same output

As stack() is the inverse operation of unstack(), the following holds True as well:

pt = df.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum').stack(['b'])
gb = df.groupby(['a','b'])[['c']].sum()
pt.equals(gb)   # True  <--- they produce the same output

In conclusion, depending on the use case, one is more convenient than the other but they can both be used instead of the other and after correctly applying stack()/unstack(), both will result in the same output.

However, there's a performance difference between the two methods. In short, pivot_table() is slower than groupby().agg().unstack(). You can read more about it from this answer.

Upvotes: 26

Mykola Zotko
Mykola Zotko

Reputation: 17854

The pivot_table is a kind of two-dimensional groupby where you group by two variables (columns) and one variable ends up in the index and another in the columns.

groupby

enter image description here

pivot_table

pivot_table

Upvotes: 10

kyramichel
kyramichel

Reputation: 491

It's more appropriate to use .pivot_table() instead of .groupby() when you need to show aggregates with both rows and column labels.

.pivot_table() makes it easy to create row and column labels at the same time and is preferable, even though you can get similar results using .groupby() with few extra steps.

Upvotes: 13

David Maust
David Maust

Reputation: 8270

Both pivot_table and groupby are used to aggregate your dataframe. The difference is only with regard to the shape of the result.

Using pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum) a table is created where a is on the row axis, b is on the column axis, and the values are the sum of c.

Example:

df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)

b         1         2
a                    
1  0.528470  0.484766
2  0.187277  0.144326
3  0.866832  0.650100

Using groupby, the dimensions given are placed into columns, and rows are created for each combination of those dimensions.

In this example, we create a series of the sum of values c, grouped by all unique combinations of a and b.

df.groupby(['a','b'])['c'].sum()

a  b
1  1    0.528470
   2    0.484766
2  1    0.187277
   2    0.144326
3  1    0.866832
   2    0.650100
Name: c, dtype: float64

A similar usage of groupby is if we omit the ['c']. In this case, it creates a dataframe (not a series) of the sums of all remaining columns grouped by unique values of a and b.

print df.groupby(["a","b"]).sum()
            c
a b          
1 1  0.528470
  2  0.484766
2 1  0.187277
  2  0.144326
3 1  0.866832
  2  0.650100

Upvotes: 141

Related Questions