Ishan Sharma
Ishan Sharma

Reputation: 6585

How to split a column into multiple columns in pandas?

I have this data in a pandas dataframe,

              name       date    close   quantity  daily_cumm_returns
0         AARTIIND 2000-01-03     3.84   21885.82            0.000000
1         AARTIIND 2000-01-04     3.60   56645.64           -0.062500
2         AARTIIND 2000-01-05     3.52   24460.62           -0.083333
3         AARTIIND 2000-01-06     3.58   42484.24           -0.067708
4         AARTIIND 2000-01-07     3.42   16736.21           -0.109375
5         AARTIIND 2000-01-10     3.42   20598.42           -0.109375
6         AARTIIND 2000-01-11     3.41   20598.42           -0.111979
7         AARTIIND 2000-01-12     3.27  100417.29           -0.148438
8         AARTIIND 2000-01-13     3.43   20598.42           -0.106771
9         AARTIIND 2000-01-14     3.60    5149.61           -0.062500
10        AARTIIND 2000-01-17     3.46   14161.42           -0.098958
11        AARTIIND 2000-01-18     3.50  136464.53           -0.088542
12        AARTIIND 2000-01-19     3.52   21885.82           -0.083333
13        AARTIIND 2000-01-20     3.73   75956.66           -0.028646
14        AARTIIND 2000-01-21     3.84   77244.07            0.000000
15        AARTIIND 2000-02-01     4.21   90118.08            0.000000
16        AARTIIND 2000-02-02     4.52  238169.21            0.073634
17        AARTIIND 2000-02-03     4.38  163499.94            0.040380
18        AARTIIND 2000-02-04     4.44  108141.71            0.054632
19        AARTIIND 2000-02-07     4.26   68232.27            0.011876
20        AARTIIND 2000-02-08     4.00  108141.71           -0.049881
21        AARTIIND 2000-02-09     3.96   32185.04           -0.059382
22        AARTIIND 2000-02-10     4.13   43771.63           -0.019002
23        AARTIIND 2000-02-11     3.96    3862.20           -0.059382
24        AARTIIND 2000-02-14     3.94   12874.01           -0.064133
25        AARTIIND 2000-02-15     3.90   33472.42           -0.073634
26        AARTIIND 2000-02-16     3.90   25748.02           -0.073634
27        AARTIIND 2000-02-17     3.90   60507.86           -0.073634
28        AARTIIND 2000-02-18     4.22   45059.04            0.002375
29        AARTIIND 2000-02-21     4.42   81106.27            0.049881

I wish to select every months data and transpose that into a new row, for e.g. the first 15 rows should become one row with name AARTIIND, date 2000-01-03 and then 15 columns having daily cummulative returns.

   name       date     first second  third  fourth  fifth .... fifteenth
0 AARTIIND 2000-01-03  0.00  -0.062  -0.083 -0.067 -0.109 .... 0.00

To group the data month wise I am using,

group = df.groupby([pd.Grouper(freq='1M', key='date'), 'name'])

Setting the rows individually by using the code below is very slow and my dataset has 1 million rows

data = pd.DataFrame(columns = ('name', 'date', 'daily_zscore_1', 'daily_zscore_2', 'daily_zscore_3', 'daily_zscore_4', 'daily_zscore_5', 'daily_zscore_6', 'daily_zscore_7', 'daily_zscore_8', 'daily_zscore_9', 'daily_zscore_10', 'daily_zscore_11', 'daily_zscore_12', 'daily_zscore_13', 'daily_zscore_14', 'daily_zscore_15'))
data.loc[0] = [x['name'].iloc[0], x['date'].iloc[0]].extend(x['daily_cumm_returns'])

Is there any other faster way to accomplish this, as I see it this is just transposing one column and hence should be very fast. I tried pivot and melt but don't understand how to use them in this situation.

Upvotes: 1

Views: 2200

Answers (2)

Ted Petrou
Ted Petrou

Reputation: 62037

This is a bit sloppy but it gets the job done.

# grab AAPL data
from pandas_datareader import data
df = data.DataReader('AAPL', 'google', start='2014-01-01')[['Close', 'Volume']]

# add name column
df['name'] = 'AAPL'

# get daily return relative to first of month
df['daily_cumm_return'] = df.resample('M')['Close'].transform(lambda x: (x - x[0]) / x[0])

# get the first of the month for each date
df['first_month_date'] = df.assign(index_col=df.index).resample('M')['index_col'].transform('first')

# get a ranking of the days 1 to n
df['day_rank']= df.resample('M')['first_month_date'].rank(method='first')

# pivot to get final
df_final = df.pivot_table(index=['name', 'first_month_date'], columns='day_rank', values='daily_cumm_return')

Sample Output

day_rank               1.0       2.0       3.0       4.0       5.0       6.0   \
name first_month_date                                                           
AAPL 2014-01-02         0.0 -0.022020 -0.016705 -0.023665 -0.017464 -0.029992   
     2014-02-03         0.0  0.014375  0.022052  0.021912  0.036148  0.054710   
     2014-03-03         0.0  0.006632  0.008754  0.005704  0.005173  0.006102   
     2014-04-01         0.0  0.001680 -0.005299 -0.018222 -0.033600 -0.033600   
     2014-05-01         0.0  0.001775  0.015976  0.004970  0.001420 -0.005917   
     2014-06-02         0.0  0.014141  0.025721  0.029729  0.026834  0.043314   
     2014-07-01         0.0 -0.000428  0.005453  0.026198  0.019568  0.019996

day_rank                   7.0       8.0       9.0       10.0      11.0  \
name first_month_date                                                     
AAPL 2014-01-02       -0.036573 -0.031511 -0.012149  0.007593  0.002025   
     2014-02-03        0.068667  0.068528  0.085555  0.084578  0.088625   
     2014-03-03        0.015785  0.016846  0.005571 -0.005704 -0.001857   
     2014-04-01       -0.020936 -0.033600 -0.040708 -0.036831 -0.043810   
     2014-05-01       -0.010059  0.002249  0.003787  0.004024 -0.004497   
     2014-06-02        0.049438  0.045095  0.027614  0.016368  0.026612   
     2014-07-01        0.016253  0.018178  0.031330  0.019247  0.013473  

day_rank                   12.0      13.0      14.0      15.0      16.0  \
name first_month_date                                                     
AAPL 2014-01-02       -0.022526 -0.007340 -0.002911  0.005442 -0.012782   
     2014-02-03        0.071458  0.059037  0.047313  0.051779  0.040893   
     2014-03-03        0.006897  0.006632  0.001857  0.009683  0.021754   
     2014-04-01       -0.041871 -0.030887 -0.019385 -0.018351 -0.031274   
     2014-05-01        0.010178  0.022130  0.022367  0.025089  0.026627   
     2014-06-02        0.025276  0.026389  0.022826  0.012248  0.011357   
     2014-07-01       -0.004598  0.009731  0.004491  0.012831  0.039243 

day_rank                   17.0      18.0      19.0      20.0      21.0  \
name first_month_date                                                     
AAPL 2014-01-02       -0.004809 -0.084282 -0.094660 -0.096431 -0.095039   
     2014-02-03        0.031542  0.052059  0.049267       NaN       NaN   
     2014-03-03        0.032763  0.022815  0.018437  0.017244  0.017111   
     2014-04-01        0.048204  0.055958  0.096795  0.093564  0.089429   
     2014-05-01        0.038225  0.057751  0.054911  0.074201  0.070178   
     2014-06-02        0.005233  0.006124  0.012137  0.024162  0.034740   
     2014-07-01        0.037532  0.044376  0.058811  0.051967  0.049508 

day_rank                   22.0      23.0  
name first_month_date                      
AAPL 2014-01-02             NaN       NaN  
     2014-02-03             NaN       NaN  
     2014-03-03             NaN       NaN  
     2014-04-01             NaN       NaN  
     2014-05-01             NaN       NaN  
     2014-06-02             NaN       NaN  
     2014-07-01        0.022241       NaN  

Upvotes: 1

dagrha
dagrha

Reputation: 2569

Admittedly this does not get exactly what you want...

I think one way to handle this problem would be to create new columns of month and day based on the datetime (date) column, then set a multiindex on the month and name, then pivot the table.

df['month'] = df.date.dt.month
df['day'] = df.date.dt.day
df.set_index(['month', 'name'], inplace=True)
df[['day', 'daily_cumm_returns']].pivot(index=df.index, columns='day')

Result is:

               daily_cumm_returns                                         \
      day                       1         2        3         4         5    
month name                                                                 
1     AARTIIND                NaN       NaN  0.00000 -0.062500 -0.083333   
2     AARTIIND                0.0  0.073634  0.04038  0.054632       NaN   

I can't figure out a way to keep the first date of each month group as a column, otherwise I think this is more or less what you're after.

Upvotes: 0

Related Questions