ytu
ytu

Reputation: 1850

ValueError when trying to have multi-index in DataFrame.pivot

I have read pandas: how to run a pivot with a multi-index? but it could not solve my problem.

Given the data frame below:

import pandas as pd
df = pd.DataFrame({
    "date": ["20180920"] * 6,
    "id": ["A123456789"] * 6,
    "test": ["a", "b", "c", "d", "e", "f"],
    "result": [70, 90, 110, "(-)", "(+)", 0.3],
    "ref": ["< 90", "70 - 100", "100 - 120", "(-)", "(-)", "< 1"]
})

I'd like to spread the test column, use the values in result, and ignore ref. In other words, the desired output is like:

       date          id      a   b    c    d    e    f
0  20180920  A123456789     70  90  110  (-)  (+)  0.3

So I tried df.pivot(index=["date", "id"], columns="test", values="result"), but it failed with ValueError: Length of passed values is 6, index implies 2. I think it is related to "If an array is passed, it must be the same length as the data." in pivot_table documentation, but I just don't understand what it means. Can someone elaborate that please?

BTW, I finally get my desired output by df.drop(columns="ref").set_index(["date", "id", "test"]).unstack(level=2). Is it the only correct way?

Upvotes: 15

Views: 17680

Answers (3)

Paul Rougieux
Paul Rougieux

Reputation: 11409

Using a function defined in pandas/issues/23955

def multiindex_pivot(df, columns=None, values=None):                                                                                                                        
    #https://github.com/pandas-dev/pandas/issues/23955                                                                                                                      
    names = list(df.index.names)                                                                                                                                            
    df = df.reset_index()                                                                                                                                                   
    list_index = df[names].values                                                                                                                                           
    tuples_index = [tuple(i) for i in list_index] # hashable                                                                                                                
    df = df.assign(tuples_index=tuples_index)                                                                                                                               
    df = df.pivot(index="tuples_index", columns=columns, values=values)                                                                                                     
    tuples_index = df.index  # reduced                                                                                                                                      
    index = pd.MultiIndex.from_tuples(tuples_index, names=names)                                                                                                            
    df.index = index                                                                                                                                                        
    return df                                                                                                                                                               

multiindex_pivot(df.set_index(['date', 'id']), columns='test', values='result')                                                                                            
Out[10]:                                                                                                                                                                            
test                  a   b    c    d    e    f                                                                                                                                     
date     id                                                                                                                                                                         
20180920 A123456789  70  90  110  (-)  (+)  0.3 

Upvotes: 3

jezrael
jezrael

Reputation: 863226

pivot is possible use, but code is a bit crazy:

df = (df.set_index(["date", "id"])
        .pivot(columns="test")['result']
        .reset_index()
        .rename_axis(None, axis=1)
     )
print (df)

       date          id   a   b    c    d    e    f
0  20180920  A123456789  70  90  110  (-)  (+)  0.3

About docs you can check issue 16578 and in pandas 0.24.0 should be improved docs or maybe new support for working with MultiIndex? A bit unclear also from issue 8160.

In my opinion your last code should be only a bit improved (same solution like @Vaishali) - create Series with MultiIndex by selecting after set_index and for unstack remove level, because by default is unstacked last level of MultiIndex - Series.unstack:

level : int, string, or list of these, default last level

Level(s) to unstack, can pass level name

#all 3 return same output
df.set_index(["date", "id", "test"])['result'].unstack()
df.set_index(["date", "id", "test"])['result'].unstack(level=2)
df.set_index(["date", "id", "test"])['result'].unstack(level=-1)

Upvotes: 11

Vaishali
Vaishali

Reputation: 38415

pivot does not accept list of columns as index so you need to use pivot_table. Here the aggregation using first is with the assumption that there are no duplicates.

pd.pivot_table(df,index=["date", "id"], columns="test", values="result", aggfunc= 'first')\
.reset_index().rename_axis(None, 1)

It would be safer to use set_index and unstack and rename_axis as @piRsquared suggested,

df.set_index(['date', 'id', 'test']).result.unstack()\
.reset_index().rename_axis(None, 1)

Either way you get,

    date    id          a   b   c   d   e   f
20180920    A123456789  70  90  110 (-) (+) 0.3

Upvotes: 21

Related Questions