Reputation: 1089

Transpose a Pandas dataframe around a column with many repeated entries

I have created a data frame that has a dozen attributes (e.g. P/E, P/B) for a set of 300 stocks at the end of each month for 10 years. The dataframe is initially indexed by month (so there are 300 observations in each month) and then by stock ticker (so that there are 400 observations associated with each ticker). My dataframe has about 12 columns and 120,000 rows.

I now want to transpose the dataframe so that the columns are the stock tickers, i.e. to have about 300 columns, one per stock. I would like the date to continue to be associated with each row, so that I have about 4,800 rows associated with each ticker in a column (12 variables x 400 observations).

#Sort the dataframe, first by ticker and then by date, then write to Excel
DF.sort_values(by=['Date', 'ticker'], inplace=True)
DF.to_excel(outPath + outFn1)

#Now make ticker the index and sort by ticker and date 
DF = DF.reset_index().set_index('ticker')
DF.sort_values(by=['ticker', 'Date'], inplace=True)
DF.to_excel(outPath + outFn2)

#Now transpose the dataframe
DF = DF.reset_index().set_index('Date')
DF = DF.transpose()
DF.to_excel(outPath + outFn3)

Unfortunately, when I do the last df.transpose(), I get 120,000 columns and only 12 rows – the 300 tickers and 400 months are transformed into 120,000 columns, with the month (the index) being displayed in the first row! Is there a way to have only 300 columns (one per ticker) and 4,800 rows (one per month per variable)?

Many thanks in advance for your assistance

Thomas Philips

Upvotes: 0

Answers (1)

Alex Fish

Reputation: 778

Assuming you have a DataFrame - data that looks like this:

Feature              0         1    ...            6         7
Month Ticker                        ...                       
0     A       0.584049  0.701758    ...     0.724876  0.091812
      B       0.546022  0.554775    ...     0.381720  0.014467
      C       0.166254  0.810845    ...     0.848234  0.078745
      D       0.164278  0.409916    ...     0.472850  0.836136
      E       0.196801  0.556134    ...     0.451976  0.339668
1     A       0.336286  0.971672    ...     0.710570  0.344017
      B       0.449617  0.052413    ...     0.878496  0.404431
      C       0.383405  0.282608    ...     0.972533  0.957979
      D       0.328034  0.068577    ...     0.365409  0.158591
      E       0.846107  0.923831    ...     0.839615  0.890490
2     A       0.054455  0.427796    ...     0.528548  0.568229
      B       0.953295  0.567187    ...     0.035515  0.467428
      C       0.155837  0.810630    ...     0.512707  0.991404
      D       0.626261  0.268854    ...     0.919632  0.630014
      E       0.978789  0.578045    ...     0.155516  0.538323
3     A       0.641031  0.733028    ...     0.911317  0.210537
      B       0.341537  0.682774    ...     0.778101  0.017675
      C       0.574298  0.707269    ...     0.967428  0.319638
      D       0.577600  0.563917    ...     0.831466  0.053403
      E       0.732570  0.475240    ...     0.696523  0.417793

EDIT: I've Just realized there is a much more simple way to do it.
data.reset_index().set_index('Month').groupby('Month').apply(lambda g: g.set_index('Ticker').T)
In terms of performance both solutions are similar. Although I was testing on a small data set. I assume that on larger ones stack might hurt.

First Stack it and reindex:

stacked = data.stack().to_frame().reset_index().set_index(['Month', 'Feature'])

And now we are ready to pivot:

stacked.pivot(index=stacked.index, columns='Ticker')[0]

The result will look like this:

Ticker                A         B         C         D         E
Month Feature                                                  
0     0        0.584049  0.546022  0.166254  0.164278  0.196801
      1        0.701758  0.554775  0.810845  0.409916  0.556134
      2        0.746143  0.996657  0.843376  0.739536  0.518822
      3        0.191602  0.274246  0.207361  0.739565  0.980208
      4        0.173600  0.095523  0.184685  0.869971  0.004018
      5        0.891771  0.508821  0.178942  0.423447  0.804024
      6        0.724876  0.381720  0.848234  0.472850  0.451976
      7        0.091812  0.014467  0.078745  0.836136  0.339668
1     0        0.336286  0.449617  0.383405  0.328034  0.846107
      1        0.971672  0.052413  0.282608  0.068577  0.923831
      2        0.769480  0.651071  0.322606  0.811118  0.321575
      3        0.527988  0.709826  0.254327  0.099533  0.117720
      4        0.240174  0.768326  0.689877  0.587520  0.162302
      5        0.287952  0.636137  0.557708  0.212823  0.832591
      6        0.710570  0.878496  0.972533  0.365409  0.839615
      7        0.344017  0.404431  0.957979  0.158591  0.890490
2     0        0.054455  0.953295  0.155837  0.626261  0.978789
      1        0.427796  0.567187  0.810630  0.268854  0.578045
      2        0.938299  0.821334  0.500624  0.037103  0.753389
      3        0.671038  0.202686  0.958819  0.793337  0.598762
      4        0.458506  0.371354  0.751473  0.624051  0.157371
      5        0.467098  0.884003  0.747713  0.771846  0.484238
      6        0.528548  0.035515  0.512707  0.919632  0.155516
      7        0.568229  0.467428  0.991404  0.630014  0.538323
3     0        0.641031  0.341537  0.574298  0.577600  0.732570
      1        0.733028  0.682774  0.707269  0.563917  0.475240
      2        0.847406  0.920411  0.017330  0.308168  0.311288
      3        0.882707  0.252812  0.663594  0.742554  0.648782
      4        0.134776  0.539955  0.825222  0.227707  0.215726
      5        0.291174  0.111133  0.645823  0.676221  0.753326
      6        0.911317  0.778101  0.967428  0.831466  0.696523
      7        0.210537  0.017675  0.319638  0.053403  0.417793

Upvotes: 1

Transpose a Pandas dataframe around a column with many repeated entries

Answers (1)

Related Questions