Reputation: 35321
I'm deeply embarrassed to ask the question in this post, because I'm sure it is completely elementary. And yet I've spent the last two hours poring over the pandas
docs, and trying everything under the sun, to no avail...
Consider, for example, a dataframe like this one:
A B C D x y z
0 a0 b0 c0 d0 0.007 0.710 0.990
1 a0 b0 c0 d1 0.283 0.040 1.027
2 a0 b0 c1 d0 0.017 0.688 2.840
3 a0 b0 c1 d1 0.167 0.132 2.471
4 a0 b1 c0 d0 0.041 0.851 1.078
5 a0 b1 c0 d1 0.235 1.027 1.027
6 a0 b1 c1 d0 0.037 0.934 2.282
7 a0 b1 c1 d1 0.023 1.049 2.826
8 a1 b0 c0 d0 0.912 0.425 1.055
9 a1 b0 c0 d1 0.329 0.932 0.836
10 a1 b0 c1 d0 0.481 0.681 0.997
11 a1 b0 c1 d1 0.782 0.595 2.294
12 a1 b1 c0 d0 0.264 0.918 0.857
13 a1 b1 c0 d1 0.053 1.001 0.920
14 a1 b1 c1 d0 1.161 1.090 1.470
15 a1 b1 c1 d1 0.130 0.992 2.121
Note that each combination of distinct values for the columns A
, B
, C
, and D
occurs exactly once in this dataframe. Hence, one can think of this subset of columns as the "key columns", and the remaining columns as the "value columns"1.
I want to produce a new frame from this one, by "unstacking" column C
. By this I mean that, for each distinct combination of values of the remaining "key columns" (A
, B
, and D
), the values of the C
column give rise to corresponding "value columns" in the new frame.
For example, if we disregard columns x
and y
for the moment, the unstacking operation I mentioned above would produce the dataframe:
A B D c0 c1
0 a0 b0 d0 0.990 2.840
1 a0 b0 d1 1.027 2.471
2 a0 b1 d0 1.078 2.282
3 a0 b1 d1 1.027 2.826
4 a1 b0 d0 1.055 0.997
5 a1 b0 d1 0.836 2.294
6 a1 b1 d0 0.857 1.470
7 a1 b1 d1 0.920 2.121
Here the values of the new columns c0
and c1
are obtained from the values (in the original dataframe) of column z
, for the corresponding unique combinations of values for columns A
, B
, and D
.
How can I do this?
It'd be nice to be able to do a similar unstacking operation with the original frame (i.e. the one with all value columns x
, y
, and z
), either using some form of hierarchical column organization in the new frame (i.e. where now columns x
, y
, and z
would each consist, in turn, of "subcolums" c0
and c1
), or perhaps some scheme for naming the new columns using composite names (e.g. in the new frame columns x
, y
, and z
would be replaced by columns columns x_c0
, x_c1
, y_c0
, y_c1
, z_c0
, and z_c1
, etc.), but this may be hoping for too much...
PS: Here's the data for the dataframe above, in TSV format:
A B C D x y z
a0 b0 c0 d0 0.007 0.71 0.99
a0 b0 c0 d1 0.283 0.04 1.027
a0 b0 c1 d0 0.017 0.688 2.84
a0 b0 c1 d1 0.167 0.132 2.471
a0 b1 c0 d0 0.041 0.851 1.078
a0 b1 c0 d1 0.235 1.027 1.027
a0 b1 c1 d0 0.037 0.934 2.282
a0 b1 c1 d1 0.023 1.049 2.826
a1 b0 c0 d0 0.912 0.425 1.055
a1 b0 c0 d1 0.329 0.932 0.836
a1 b0 c1 d0 0.481 0.681 0.997
a1 b0 c1 d1 0.782 0.595 2.294
a1 b1 c0 d0 0.264 0.918 0.857
a1 b1 c0 d1 0.053 1.001 0.92
a1 b1 c1 d0 1.161 1.09 1.47
a1 b1 c1 d1 0.13 0.992 2.121
1More explicitly, the dataframe is isomorphic to a dict
whose keys are the 4-tuples ('a0', 'b0', 'c0', 'd0'), ('a0', 'b0', 'c0', 'd1'), ..., ('a1', 'b1', 'c1', 'd1')
, and whose values are the 3-tuples (0.007, 0.710, 0.990), (0.283, 0.040, 1.027), ..., (0.130, 0.992, 2.121)
. One of the many things I tried was to find a way to index the dataframe by unique combinations of the "key columns". Same story: this should be "hello-world"-level stuff in pandas
, but I was not able to find how to do it in the documentation. Clearly my brain and the brain of whoever wrote the pandas
docs are perfectly orthogonal to each other...
Upvotes: 2
Views: 300
Reputation: 353369
It'd be nice to be able to do a similar unstacking operation with the original frame (i.e. the one with all value columns x, y, and z), either using some form of hierarchical column organization in the new frame (i.e. where now columns x, y, and z would each consist, in turn, of "subcolums" c0 and c1)
How about:
>>> df.pivot_table(values=["x","y","z"],rows=["A","B","D"], cols="C")
x y z
C c0 c1 c0 c1 c0 c1
A B D
a0 b0 d0 0.007 0.017 0.710 0.688 0.990 2.840
d1 0.283 0.167 0.040 0.132 1.027 2.471
b1 d0 0.041 0.037 0.851 0.934 1.078 2.282
d1 0.235 0.023 1.027 1.049 1.027 2.826
a1 b0 d0 0.912 0.481 0.425 0.681 1.055 0.997
d1 0.329 0.782 0.932 0.595 0.836 2.294
b1 d0 0.264 1.161 0.918 1.090 0.857 1.470
d1 0.053 0.130 1.001 0.992 0.920 2.121
See here, and note that the default aggregation function is mean
. If you know that you don't have more than one value going to the same cell, as here, it doesn't matter much, but I once got myself into trouble because I thought it was sum
.
Upvotes: 2