hobscrk777
hobscrk777

Reputation: 2647

Reshaping Pandas DataFrame with Repeated Column Index

Suppose I have the following DataFrame:

>>> cols = ['model', 'parameter', 'condition', 'value']
>>> df = pd.DataFrame([['BMW', '0-60', 'rain', '7'], ['BMW', '0-60', 'sun', '7'],
                   ['BMW','mpg', 'rain','25'], 
                   ['BMW', 'stars', 'rain','5'],
                   ['Toyota', '0-60', 'rain','9'], 
                   ['Toyota','mpg', 'rain','40'], 
                   ['Toyota', 'stars', 'rain','4']], columns=cols)

>>> df
    model parameter condition value
0     BMW      0-60      rain     7
1     BMW      0-60       sun     7
2     BMW       mpg      rain    25
3     BMW     stars      rain     5
4  Toyota      0-60      rain     9
5  Toyota       mpg      rain    40
6  Toyota     stars      rain     4

This is a list of performance metrics for various cars at different conditions. This is a made up data set, of course, but its representative of my problem.

What I ultimately want is to have observation for a given condition on its own row, and each metric on its own column. This would look something like this:

    parameter  condition  0-60   mpg    stars
     model        
0     BMW       rain       7      25     5
1     BMW       sun        7      NaN    NaN
2     Toyota    rain       9      40     4

Note that I just made up the format above. I don't know if Pandas would generate something exactly like that, but that's the general idea. I would also of course transform the "condition" into a Boolean array and fill in the NaNs.

My problem is that when I try to use the pivot method I get an error. I think this is because my "column" key is repeated (because I have BMW 0-60 stats for the rain and for the sun conditions).

df.pivot(index='model',columns='parameter')
ValueError: Index contains duplicate entries, cannot reshape

Does anyone know of a slick way to do this? I'm finding a lot of these Pandas reshaping methods to be quite obtuse.

Upvotes: 0

Views: 1077

Answers (2)

Alex Riley
Alex Riley

Reputation: 176770

You can get the result you want using pivot_table and passing the following parameters:

>>> df.pivot_table(index=['model', 'condition'], values='value', columns='parameter')
parameter         0-60  mpg  stars
model  condition                  
BMW    rain          7   25      5
       sun           7  NaN    NaN
Toyota rain          9   40      4

(You may need to ensure the "value" column has numeric types first or else you can pass aggfunc=lambda x: x in the pivot_table function to get around this requirement.)

Upvotes: 0

Alex
Alex

Reputation: 19104

You can just change the index and unstack it...

df.set_index(['model', 'condition', 'parameter']).unstack()

returns

                 value           
parameter         0-60  mpg stars
model  condition                 
BMW    rain          7   25     5
       sun           7  NaN   NaN
Toyota rain          9   40     4

Upvotes: 1

Related Questions