Reputation: 49002
A common workflow that I have in pandas is getting data from some numerical function in "wide" form and turning it into a "long" form dataframe for plotting and statistical modeling.
What I mean by wide form is that there is variable information encoding in the columns. For instance, say I measured some value at each of 5 timepoints in 10 different subjects:
wide_df = pd.DataFrame(np.random.randn(10, 5),
index=pd.Series(list("abcdefghij"), name="subject"),
columns=pd.Series(np.arange(5) * 2, name="timepoint"))
print wide_df
timepoint 0 2 4 6 8
subject
a -0.670881 0.959608 -0.480081 0.142092 1.697058
b 2.369493 -0.561081 -0.183635 -0.807523 -0.421347
c -0.908420 0.629171 0.196728 -0.907443 0.264352
d -0.390138 -1.821304 -1.994605 0.225164 0.187649
e -0.860542 -0.998323 -0.490968 -0.815570 -1.009524
f -0.917390 -0.120567 -0.893095 -0.359155 -0.204112
g 0.557500 -1.522631 -1.175746 0.705043 -0.366932
h -0.817043 2.204493 -0.305202 0.464969 0.280027
i -1.137253 0.350984 0.095577 0.468167 -0.058058
j -0.569986 2.438580 -0.514894 0.860504 1.397393
[10 rows x 5 columns]
The quickest way I know how to wrangle this thing into a long form dataframe is using stack
and then reset_index
:
long_df = wide_df.stack().reset_index()
print long_df.head()
subject timepoint 0
0 a 0 -0.670881
1 a 2 0.959608
2 a 4 -0.480081
3 a 6 0.142092
4 a 8 1.697058
[5 rows x 3 columns]
The problem is that my "value" column is now named 0
. I could do
long_series = wide_df.stack()
long_series.name = "value"
long_df = long_series.reset_index()
But that is more typing, requires naming an intermediate variable, and mixes method calls with attribute assignment in a way that really breaks up my flow.
Is there a way to do this in one line? I thought maybe df.stack
would take a name
argument, but it doesn't, and Series
objects don't seem to have a set_name
method that I can find.
I do know about pandas.melt
, but it seems like overkill in this case of "pure" wide table data, and it drops the subject
index which is important. Is there another answer here?
Upvotes: 0
Views: 315
Reputation: 128948
Their is a name
argument to Series.reset_index for just this reason
In [14]: wide_df.stack().reset_index(name='foo')
Out[14]:
subject timepoint foo
0 a 0 -0.179968
1 a 2 1.559283
2 a 4 1.020142
3 a 6 -0.899663
4 a 8 2.983990
5 b 0 0.586476
6 b 2 0.055108
7 b 4 1.834005
8 b 6 1.226371
9 b 8 0.953103
10 c 0 -0.919273
You could define this if you want to as well (and would be a nice add to DataFrame):
In [14]: def _melt(self, *args, **kwargs):
....: return pd.melt(self.reset_index(), *args, **kwargs)
....:
In [15]: DataFrame.melt = _melt
In [19]: wide_df.melt('subject',value_name='foo')
Out[19]:
subject timepoint foo
0 a 0 0.374912
1 b 0 -0.016272
2 c 0 -0.510553
3 d 0 -1.532472
4 e 0 -0.115107
5 f 0 -0.101772
6 g 0 -0.020966
7 h 0 0.427469
Upvotes: 5