Naming Pandas Series while stacking from DataFrame

Question

A common workflow that I have in pandas is getting data from some numerical function in "wide" form and turning it into a "long" form dataframe for plotting and statistical modeling.

What I mean by wide form is that there is variable information encoding in the columns. For instance, say I measured some value at each of 5 timepoints in 10 different subjects:

wide_df = pd.DataFrame(np.random.randn(10, 5),
                       index=pd.Series(list("abcdefghij"), name="subject"),
                       columns=pd.Series(np.arange(5) * 2, name="timepoint"))
print wide_df


timepoint         0         2         4         6         8
subject                                                    
a         -0.670881  0.959608 -0.480081  0.142092  1.697058
b          2.369493 -0.561081 -0.183635 -0.807523 -0.421347
c         -0.908420  0.629171  0.196728 -0.907443  0.264352
d         -0.390138 -1.821304 -1.994605  0.225164  0.187649
e         -0.860542 -0.998323 -0.490968 -0.815570 -1.009524
f         -0.917390 -0.120567 -0.893095 -0.359155 -0.204112
g          0.557500 -1.522631 -1.175746  0.705043 -0.366932
h         -0.817043  2.204493 -0.305202  0.464969  0.280027
i         -1.137253  0.350984  0.095577  0.468167 -0.058058
j         -0.569986  2.438580 -0.514894  0.860504  1.397393

[10 rows x 5 columns]

The quickest way I know how to wrangle this thing into a long form dataframe is using stack and then reset_index:

long_df = wide_df.stack().reset_index()
print long_df.head()

 subject  timepoint         0
0       a          0 -0.670881
1       a          2  0.959608
2       a          4 -0.480081
3       a          6  0.142092
4       a          8  1.697058

[5 rows x 3 columns]

The problem is that my "value" column is now named 0. I could do

long_series = wide_df.stack()
long_series.name = "value"
long_df = long_series.reset_index()

But that is more typing, requires naming an intermediate variable, and mixes method calls with attribute assignment in a way that really breaks up my flow.

Is there a way to do this in one line? I thought maybe df.stack would take a name argument, but it doesn't, and Series objects don't seem to have a set_name method that I can find.

I do know about pandas.melt, but it seems like overkill in this case of "pure" wide table data, and it drops the subject index which is important. Is there another answer here?

Jeff · Accepted Answer

Their is a name argument to Series.reset_index for just this reason

In [14]: wide_df.stack().reset_index(name='foo')
Out[14]: 
   subject  timepoint       foo
0        a          0 -0.179968
1        a          2  1.559283
2        a          4  1.020142
3        a          6 -0.899663
4        a          8  2.983990
5        b          0  0.586476
6        b          2  0.055108
7        b          4  1.834005
8        b          6  1.226371
9        b          8  0.953103
10       c          0 -0.919273

You could define this if you want to as well (and would be a nice add to DataFrame):

In [14]: def _melt(self, *args, **kwargs):
   ....:     return pd.melt(self.reset_index(), *args, **kwargs)
   ....: 

In [15]: DataFrame.melt = _melt

In [19]: wide_df.melt('subject',value_name='foo')
Out[19]: 
   subject  timepoint       foo
0        a          0  0.374912
1        b          0 -0.016272
2        c          0 -0.510553
3        d          0 -1.532472
4        e          0 -0.115107
5        f          0 -0.101772
6        g          0 -0.020966
7        h          0  0.427469

Naming Pandas Series while stacking from DataFrame

Answers (1)

Related Questions