ZAR
ZAR

Reputation: 2736

Pandas: groupby unstack, duplicate index error

I'm having trouble pivoting/reshaping my dataframe. After a melt() operation, I have the following df:

|----|------|------|------------|
| ID | Rank | Var  | Val        |
|----|------|------|------------|
|  1 |   1  | date | 2020-01-01 |
|----|------|------|------------|
|  1 |   2  | date | 2020-02-01 |
|----|------|------|------------|
|  2 |   1  | date | 2020-01-01 |
|----|------|------|------------|
|  2 |   2  | date | 2020-02-01 |
|----|------|------|------------|
|  1 |   1  | amt  | 320        |
|----|------|------|------------|
|  1 |   2  | amt  | 480        |
|----|------|------|------------|
|  2 |   1  | amt  | 620        |
|----|------|------|------------|
|  2 |   2  | amt  | 400        |
|----|------|------|------------|

The result I am looking for would be to convert the rank values into columns:

|----|------|------------|------------|
| ID | Var  | 1          | 2          |
|----|------|------------|------------|
| 1  | date | 2020-01-01 | 2020-02-01 |
|    |------|------------|------------|
|    | amt  | 320        | 480        |
|----|------|------------|------------|
| 2  | date | 2020-01-01 | 2020-02-01 |
|    |------|------------|------------|
|    | amt  | 620        | 400        |
|----|------|------------|------------|

Immediately trying to unstack(level='Rank') results in a

ValueError: Index contains duplicate entries, cannot reshape

Ok, we do have duplicate IDs, let's groupby to consolidate:

df.set_index(['ID', 'Rank']).groupby(['ID', 'Rank']).apply(lambda x: x)

The apply() is just so we can return a dataframe and preview the results, which are:

|----|------|------|------------|
| ID | Rank | Var  | Val        |
|----|------|------|------------|
|  1 |   1  | date | 2020-01-01 |
|    |------|------|------------|
|    |   1  | amt  | 320        |
|    |------|------|------------|
|    |   2  | date | 2020-02-01 |
|    |------|------|------------|
|    |   2  | amt  | 480        |
|----|------|------|------------|
|  2 |   1  | date | 2020-01-01 |
|    |------|------|------------|
|    |   2  | date | 2020-02-01 |
|    |------|------|------------|
|    |   1  | amt  | 620        |
|    |------|------|------------|
|    |   2  | amt  | 400        |
|----|------|------|------------|

This is closer. Now we just need to convert Rank into cols:

df.set_index(['ID', 'Rank']).groupby(['ID', 'Rank']).apply(lambda x: x).unstack(level='Rank')

ValueError: Index contains duplicate entries, cannot reshape

????

I also tried using pivot_table, but not all of the columns are numeric (the above example is simplified, but you can see date in here, which throws off pivot_table).

I seem to be just one method away, but I've tried unstacking, reindexing, pivoting, explode()-ing, I just can't figure out this last part...

How can I get the desired dataframe?

Thank you!

Upvotes: 0

Views: 2034

Answers (1)

Aviad Rozenhek
Aviad Rozenhek

Reputation: 2409

your issue is that the index keys must be unique to unstack.

first, let's recreate this toy example

import pandas as pd
from IPython.display import display

df= pd.DataFrame(columns=['ID', 'Rank','Var', 'Val'],  data=[
    [1,1,'date', '2020-01-01'],
    [1,2,'date','2020-02-01'],
    [2,1,'date','2020-02-01'],
    [2,2,'date','2020-02-01'],
    [1,1,'amt',320],
    [1,2,'amt',480],
    [2,1,'amt',620],
    [2,2,'amt',400],
])

df = df.set_index(['ID', 'Rank'])
display(df)

now, lets solve

lets put unique keys in the index, namely the ID and Var fields. now unstacking is trivial

df.set_index('Var', append=True).unstack('Rank')

Upvotes: 1

Related Questions