endangeredoxen
endangeredoxen

Reputation: 105

pandas DataFrame reshape by multiple column values

I'm trying to free myself of JMP for data analysis but cannot determine the pandas equivalent of JMP's Split Columns function. I'm starting with the following DataFrame:

In [1]: df = pd.DataFrame({'Level0': [0,0,0,0,0,0,1,1,1,1,1,1], 'Level1': [0,1,0,1,0,1,0,1,0,1,0,1], 'Vals': [1,3,2,4,1,6,7,5,3,3,2,8]})
In [2]: df
Out[2]:
    Level0  Level1  Vals
0        0       0     1
1        0       1     3
2        0       0     2
3        0       1     4
4        0       0     1
5        0       1     6
6        1       0     7
7        1       1     5
8        1       0     3
9        1       1     3
10       1       0     2
11       1       1     8

I can handle some of the output scenarios of JMP's function using the pivot_table function, but I'm stumped on the case where the Vals column is split by unique combinations of Level0 and Level1 to give the following output:

Level0   0       1
Level1   0   1   0   1
0        1   3   7   5
1        2   4   3   3
2        1   6   2   8

I tried pd.pivot_table(df, values='Vals', columns=['Level0', 'Level1']) but this gives mean values for the different combinations:

Level0  Level1
0       0         1.333333
        1         4.333333
1       0         4.000000
        1         5.333333

I also tried pd.pivot_table(df, values='Vals', index=df.index, columns=['Level0', 'Level1'] which gets me the column headers I want but doesn't work because it forces the output to have the same number of rows as the original so the output has a lot of NaN values:

Level0   0       1
Level1   0   1   0   1
0        1 NaN NaN NaN
1      NaN   3 NaN NaN
2        2 NaN NaN NaN
3      NaN   4 NaN NaN
4        1 NaN NaN NaN
5      NaN   6 NaN NaN
6      NaN NaN   7 NaN
7      NaN NaN NaN   5
8      NaN NaN   3 NaN
9      NaN NaN NaN   3
10     NaN NaN   2 NaN
11     NaN NaN NaN   8

Any suggestions?

Upvotes: 4

Views: 984

Answers (1)

user2285236
user2285236

Reputation:

It's a bit of workaround, but you can do:

df.pivot_table(index=df.groupby(['Level0', 'Level1']).cumcount(), 
               columns=['Level0', 'Level1'], values='Vals', aggfunc='first')
Out: 
Level0  0     1   
Level1  0  1  0  1
0       1  3  7  5
1       2  4  3  3
2       1  6  2  8

The idea here is that the index of the output is not readily available in the original DataFrame. You can get it with the following:

df.groupby(['Level0', 'Level1']).cumcount()
Out: 
0     0
1     0
2     1
3     1
4     2
5     2
6     0
7     0
8     1
9     1
10    2
11    2
dtype: int64

Now if you pass this as the index of the pivot_table, an arbitrary aggfunc (mean, min, max, first or last) should work for you as those index-column pairs have only one entry.

Upvotes: 3

Related Questions