ShanZhengYang
ShanZhengYang

Reputation: 17631

Pivot a pandas DataFrame to be the correct format: `DataError: No numeric types to aggregate`

Here is a pandas DataFrame I would like to manipulate:

import pandas as pd

data = {"grouping": ["item1", "item1", "item1", "item2", "item2", "item2", "item2", ...],
        "labels": ["A", "B", "C", "A", "B", "C", "D", ...],
        "count": [5, 1, 8, 3, 731, 189, 9, ...]}

df = pd.DataFrame(data)

print(df)
>>>   grouping            labels       count
0        item1             A            5
1        item1             B            1
2        item1             C            8
3        item2             A            3
4        item2             B          731
5        item2             C          189
6        item2             D            9
7        ...               ...         ....

I would like to "unfold" this dataframe into the following format:

grouping    A    B    C    D
item1       5    1    8    3
item2       3    731  189  9
....        ........

How would one do this? I would think that this would work:

pd.pivot_table(df,index=["grouping", "labels"]

but I get the following error:

DataError: No numeric types to aggregate

Upvotes: 3

Views: 561

Answers (4)

piRSquared
piRSquared

Reputation: 294258

There are four idiomatic pandas ways to do this.

  • No duplicates among grouping columns. Does not require aggregation
    • pivot
    • set_index
  • Duplicates among grouping columns. Does require aggregation
    • pivot_table
    • groupby

pivot

df.pivot('grouping', 'labels', 'count')

set_index

df.set_index(['grouping', 'labels'])['count'].unstack()

pivot_table

df.pivot_table('count', 'grouping', 'labels')

groupby

df.groupby(['grouping', 'labels'])['count'].sum().unstack()

All yield

labels      A      B      C    D
grouping                        
item1     5.0    1.0    8.0  NaN
item2     3.0  731.0  189.0  9.0

timing

enter image description here

With the groupby, set_index, or pivot_table approach, you can easily fill in missing values with fill_value=0

df.pivot_table('count', 'grouping', 'labels', fill_value=0)

df.groupby(['grouping', 'labels'])['count'].sum().unstack(fill_value=0)

df.set_index(['grouping', 'labels'])['count'].sum().unstack(fill_value=0)

All yield

labels    A    B    C  D
grouping                
item1     5    1    8  0
item2     3  731  189  9

Additional thoughts on groupby

Because we don't require any aggregation. If we wanted to use groupby, we can minimize the impact of the implicit aggregation by utilizing a less impactful aggregator.

df.groupby(['grouping', 'labels'])['count'].max().unstack()

or

df.groupby(['grouping', 'labels'])['count'].first().unstack()

timing groupby

enter image description here

Upvotes: 5

Scott Boston
Scott Boston

Reputation: 153460

Use set_index and unstack:

df = df.set_index(['grouping','labels']).unstack().rename_axis(None)
df.columns = df.columns.droplevel()
print(df)

Output:

labels  A    B    C     D
item1   5    1    8  None
item2   3  731  189     9

Upvotes: 6

greole
greole

Reputation: 4771

Try:

In [1]: import pandas as pd
   ...: 
   ...: data = {"grouping": ["item1", "item1", "item1", "item2", "item2", "item2", "item2"],
   ...:         "labels": ["A", "B", "C", "A", "B", "C", "D"],
   ...:         "count": [5, 1, 8, 3, 731, 189, 9]}
   ...: 
In [2]: df = pd.DataFrame(data)
In [3]: df.pivot_table(index="grouping",columns="labels")

Out[3]: 
             count              
    labels       A    B    C   D
    grouping                    
    item1        5    1    8 NaN
    item2        3  731  189   9

Upvotes: 2

BrenBarn
BrenBarn

Reputation: 251383

You put labels in the index, but you want it in the columns:

>>> df.pivot_table(index='grouping', columns='labels')
         count                   
labels       A      B      C    D
grouping                         
item1      5.0    1.0    8.0  NaN
item2      3.0  731.0  189.0  9.0

Note that this makes the columns a MultiIndex. If you don't want that, explicitly pass values: df.pivot_table(index='grouping', columns='labels', values='count').

Also, note that the kind of reshape you seem to be looking for will only be possible if each combination of grouping and label has exactly one or zero values. If any combination occurs more than once, you need to decide how to aggregate them (e.g., by summing the matching values).

Upvotes: 3

Related Questions