Reputation: 608
I have a large data frame, df, containing 4 columns:
id period ret_1m mkt_ret_1m
131146 CAN00WG0 199609 -0.1538 0.047104
133530 CAN00WG0 199610 -0.0455 -0.014143
135913 CAN00WG0 199611 0.0000 0.040926
138334 CAN00WG0 199612 0.2952 0.008723
140794 CAN00WG0 199701 -0.0257 0.039916
143274 CAN00WG0 199702 -0.0038 -0.025442
145754 CAN00WG0 199703 -0.2992 -0.049279
148246 CAN00WG0 199704 -0.0919 -0.005948
150774 CAN00WG0 199705 0.0595 0.122322
153318 CAN00WG0 199706 -0.0337 0.045765
id period ret_1m mkt_ret_1m
160980 CAN00WH0 199709 0.0757 0.079293
163569 CAN00WH0 199710 -0.0741 -0.044000
166159 CAN00WH0 199711 0.1000 -0.014644
168782 CAN00WH0 199712 -0.0909 -0.007072
171399 CAN00WH0 199801 -0.0100 0.001381
174022 CAN00WH0 199802 0.1919 0.081924
176637 CAN00WH0 199803 0.0085 0.050415
179255 CAN00WH0 199804 -0.0168 0.018393
181880 CAN00WH0 199805 0.0427 -0.051279
184516 CAN00WH0 199806 -0.0656 -0.011516
id period ret_1m mkt_ret_1m
143275 CAN00WO0 199702 -0.1176 -0.025442
145755 CAN00WO0 199703 -0.0074 -0.049279
148247 CAN00WO0 199704 -0.0075 -0.005948
150775 CAN00WO0 199705 0.0451 0.122322
etc.
I am attempting to calculate a common financial measure, known as beta, using a function, that takes two of the columns, ret_1m, the monthly stock_return, and ret_1m_mkt, the market 1 month return for the same period (period_id). I want to apply a function (calc_beta) to calculate the 12-month result of this function on a 12 month rolling basis.
To do this, I am creating a groupby object:
grp = df.groupby('id')
What I would like to do is use something like:
period = 12
for stock, sub_df in grp:
arg = sub_df[['ret_1m', 'mkt_ret_1m']]
beta = pd.rolling_apply(arg, period, calc_beta, min_periods = period)
Now, here is the first problem. According to the documentation, pd.rolling_apply arg can be either a series or a data frame. However, it appears that the data frame I supply is converted into a numpy array that can only contain one column of data, rather than the two I have tried to supply. So my code below for calc_beta will not work, because I need to pass both the stock and market returns:
def calc_beta(np_array)
s = np_array[:,0] # stock returns are column zero from numpy array
m = np_array[:,1] # market returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
So my questions are as follows, I think it makes sense to list them in this way:
(i) How can I pass a data frame/multiple series/numpy array with more than one column to calc_beta using rolling_apply?
(ii) How can I return more than one value (e.g. the beta) from the calc_beta function?
(iii) Having calculated rolling quantities, how can I recombined with the original dataframe df so that I have the rolling quantities corresponding to the correct date in the period column?
(iv) Is there a better (vectorized) way of achieving this? I have seen some similar questions using e.g. df.apply(pd.rolling_apply,period,??) but I did not understand how these worked.
I gather that rolling_apply previously was unable to handle data frames, but the documentations suggests that it is now able to do so. My pandas.version is 0.16.1.
Thanks for any help! I have lost 1.5 days trying to figure this out and am totally stumped.
Ultimately, what I want is something like this:
id period ret_1m mkt_ret_1m beta other_quantities
131146 CAN00WG0 199609 -0.1538 0.047104 0.521 xxx
133530 CAN00WG0 199610 -0.0455 -0.014143 0.627 xxxx
135913 CAN00WG0 199611 0.0000 0.040926 0.341 xxx
138334 CAN00WG0 199612 0.2952 0.008723 0.567 xx
140794 CAN00WG0 199701 -0.0257 0.039916 0.4612 xxx
143274 CAN00WG0 199702 -0.0038 -0.025442 0.215 xxx
145754 CAN00WG0 199703 -0.2992 -0.049279 0.4678 xxx
148246 CAN00WG0 199704 -0.0919 -0.005948 -0.4225 xxx
150774 CAN00WG0 199705 0.0595 0.122322 0.780 xxx
153318 CAN00WG0 199706 -0.0337 0.045765 0.623 xxx
id period ret_1m mkt_ret_1m beta other_quantities
160980 CAN00WH0 199709 0.0757 0.079293 -0.913 xx
163569 CAN00WH0 199710 -0.0741 -0.044000 0.894 xxx
166159 CAN00WH0 199711 0.1000 -0.014644 0.563 xxx
168782 CAN00WH0 199712 -0.0909 -0.007072 0.734 xxx
171399 CAN00WH0 199801 -0.0100 0.001381 0.894 xxxx
174022 CAN00WH0 199802 0.1919 0.081924 0.789 xx
176637 CAN00WH0 199803 0.0085 0.050415 0.1563 xxxx
179255 CAN00WH0 199804 -0.0168 0.018393 -0.64 xxxx
181880 CAN00WH0 199805 0.0427 -0.051279 -0.742 xxx
184516 CAN00WH0 199806 -0.0656 -0.011516 0.925 xxx
id period ret_1m mkt_ret_1m beta
143275 CAN00WO0 199702 -0.1176 -0.025442 -1.52 xx
145755 CAN00WO0 199703 -0.0074 -0.049279 -0.632 xxx
148247 CAN00WO0 199704 -0.0075 -0.005948 1.521 xx
150775 CAN00WO0 199705 0.0451 0.122322 0.0321 xxx
etc.
Upvotes: 4
Views: 21289
Reputation: 161
Good News! In pandas 1.3.0, a new method "Table" is added to rolling.apply, everything solved!
Here is an example code.
def coefcalc(df):
Y=np.ascontiguousarray(df[:,0]) # rebuild a contigous numpy array for faster reg
X=np.ascontiguousarray(df[:,1:])
try:
b=(np.linalg.inv(X.T@X))@X.T@Y
return np.nan,b[0],b[1],b[2],b[3],b[4]
except:
return np.nan,np.nan,np.nan,np.nan,np.nan,np.nan
fama5=pd.read_csv('F-F_Research_Data_5_Factors_2x3.csv')
fama5=fama5.iloc[0:695,:].rename(columns={'Mkt-RF':'Mkt_Rf'})
fama5['date']=pd.to_datetime(fama5.date,format='%Y%m',errors='ignore')
for var in ['Mkt_Rf','SMB','HML','RMW','CMA','RF']:
fama5[var]=pd.to_numeric(fama5[var])/100
fama5[var]=fama5[var].astype('float64')
fama5=fama5.set_index('date')
fama5=fama5.drop('RF',axis=1)
beta=fama5.rolling('100d', method="table", min_periods=0).apply(coefcalc, raw=True, engine="numba")
You can download F-F_Research_Data_5_Factors_2x3.csv from Ken French's Homepage https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip
Upvotes: 1
Reputation: 161
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)):
sub_df = df.iloc[max(i-period, 0):i,:] #get a subsample to run
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]+1 # mind the forward looking bias,your return in time t should not be inclued in the beta calculating in time t
result[idx] = func(sub_df)
return result
I fix a forward looking bias for Happy001's code. It's a finance problem, so it should be cautious.
I find that vlmercado's answer is so wrong. If you simply use pd.rolling_cov and pd.rolling_var you are making mistakes in finance. Firstly, it's obvious that the second stock CAN00WH0 do not have any NaN beta, since it use the return of CAN00WG0, which is wrong at all. Secondly, consider such a situation: a stock suspended for ten years, and you can also get that sample into your beta calculating.
I find that pandas.rolling also works for Timestamp, you can see how in my answer above if interested. I change the code of Happy001's code . It's not the fastest way, but is at least 20x faster than the origin code.
crsp_daily['date']=pd.to_datetime(crsp_daily['date'])
crsp_daily=crsp_daily.set_index('date') # rolling needs a time serie index
crsp_daily.index=pd.DatetimeIndex(crsp_daily.index)
calc=crsp_daily[['permno','ret','mkt_ret']]
grp = calc.groupby('permno') #rolling beta for each stock
beta=pd.DataFrame()
for stock, sub_df in grp:
sub2_df=sub_df[['ret','mkt_ret']].sort_index()
beta_m = sub2_df.rolling('1825d',min_periods=150).cov() # 5yr rolling beta , note that d for day, and you cannot use w/m/y, s/d are availiable.
beta_m['beta']=beta_m['ret']/beta_m['mkt_ret']
beta_m=beta_m.xs('mkt_ret',level=1,axis=0)
beta=beta.append(pd.merge(sub_df,pd.DataFrame(beta_m['beta'])))
beta=beta.reset_index()
beta=beta[['date','permno','beta']]
Upvotes: 1
Reputation: 161
BTW, since nobody ask the mutil-variable rolling regression in python, I also find a way to slove this problem. The key is that first stack them into one column, and reshape the dataframe in function.
Here is the Code
import pandas as pd
import numpy as np
import timeit
from numba import jit
@jit(nopython=True, cache=True,fastmath=True) # numba only support numpy but pandas, so pd.DataFrame is forbiddened.
def coefcalc(df,coefpos,varnum):
# coefpos: which coef you need, for example I want alpha, so I would set the coefpos as 5 to obtain the alpha since the constant is in the last column in X. varnum: how many variables you put in df except "stkcd and date", in this sample, it's 7 (return,five fama factors, and a constant)
if np.mod(df.shape[0],varnum)==0:
df=df.reshape(df.shape[0]//varnum,varnum) # reshape the one column to n column for reg.
Y=np.ascontiguousarray(df[:,0]) # rebuild a contigous numpy array for faster reg
X=np.ascontiguousarray(df[:,1:])
try:
b=(np.linalg.inv(X.T@X))@X.T@Y
result=b[coefpos]
except:
result=np.nan
return result
else:
return np.nan
calc2=pd.read_csv(r'sample.csv')
# A way for rolling beta/alpha
calc2=calc2.set_index(['date','F_INFO_WINDCODE'])
calc2=calc2.dropna() # regression will drop Nan automatically
calc2=calc2.stack().reset_index().set_index('date') # put n columns into one columns, and let datetime64 variable (date) to be the index.
localtime = time.asctime( time.localtime(time.time()) )
print(localtime)
order_of_variable=5 # expect for y (return), start from zero.
total_number_variable=7 # stkcd/date/return/fama5/constant
required_sample=30*total_number_variable # Monthly data
# the parallel kwarg may require loops in def so I turn it off.
alphaest=calc2.groupby('F_INFO_WINDCODE').rolling('1095d',min_periods=required_sample)[0].apply(lambda x:coefcalc(x,5,7),engine='numba', raw=True,engine_kwargs={'nopython': True, 'nogil': True, 'parallel': False})
# as the pandas's document numba engine is faster than cpython when obs is over 1 million.
# you can check in https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html , the numba engine part.
localtime = time.asctime( time.localtime(time.time()) )
print(localtime)
Upvotes: 1
Reputation: 1888
Try pd.rolling_cov() and pd.rolling.var() as follows:
import pandas as pd
import numpy as np
from StringIO import StringIO
df = pd.read_csv(StringIO(''' id period ret_1m mkt_ret_1m
131146 CAN00WG0 199609 -0.1538 0.047104
133530 CAN00WG0 199610 -0.0455 -0.014143
135913 CAN00WG0 199611 0.0000 0.040926
138334 CAN00WG0 199612 0.2952 0.008723
140794 CAN00WG0 199701 -0.0257 0.039916
143274 CAN00WG0 199702 -0.0038 -0.025442
145754 CAN00WG0 199703 -0.2992 -0.049279
148246 CAN00WG0 199704 -0.0919 -0.005948
150774 CAN00WG0 199705 0.0595 0.122322
153318 CAN00WG0 199706 -0.0337 0.045765
160980 CAN00WH0 199709 0.0757 0.079293
163569 CAN00WH0 199710 -0.0741 -0.044000
166159 CAN00WH0 199711 0.1000 -0.014644
168782 CAN00WH0 199712 -0.0909 -0.007072
171399 CAN00WH0 199801 -0.0100 0.001381
174022 CAN00WH0 199802 0.1919 0.081924
176637 CAN00WH0 199803 0.0085 0.050415
179255 CAN00WH0 199804 -0.0168 0.018393
181880 CAN00WH0 199805 0.0427 -0.051279
184516 CAN00WH0 199806 -0.0656 -0.011516
143275 CAN00WO0 199702 -0.1176 -0.025442
145755 CAN00WO0 199703 -0.0074 -0.049279
148247 CAN00WO0 199704 -0.0075 -0.005948
150775 CAN00WO0 199705 0.0451 0.122322'''), sep='\s+')
df['beta'] = pd.rolling_cov(df['ret_1m'], df['mkt_ret_1m'], window=6) / pd.rolling_var(df['mkt_ret_1m'], window=6)
print df
Output:
id period ret_1m mkt_ret_1m beta
131146 CAN00WG0 199609 -0.1538 0.047104 NaN
133530 CAN00WG0 199610 -0.0455 -0.014143 NaN
135913 CAN00WG0 199611 0.0000 0.040926 NaN
138334 CAN00WG0 199612 0.2952 0.008723 NaN
140794 CAN00WG0 199701 -0.0257 0.039916 NaN
143274 CAN00WG0 199702 -0.0038 -0.025442 -1.245908
145754 CAN00WG0 199703 -0.2992 -0.049279 2.574464
148246 CAN00WG0 199704 -0.0919 -0.005948 2.657887
150774 CAN00WG0 199705 0.0595 0.122322 1.371090
153318 CAN00WG0 199706 -0.0337 0.045765 1.494095
160980 CAN00WH0 199709 0.0757 0.079293 1.616520
163569 CAN00WH0 199710 -0.0741 -0.044000 1.630411
166159 CAN00WH0 199711 0.1000 -0.014644 0.651220
168782 CAN00WH0 199712 -0.0909 -0.007072 0.652148
171399 CAN00WH0 199801 -0.0100 0.001381 0.724120
174022 CAN00WH0 199802 0.1919 0.081924 1.542782
176637 CAN00WH0 199803 0.0085 0.050415 1.605407
179255 CAN00WH0 199804 -0.0168 0.018393 1.571015
181880 CAN00WH0 199805 0.0427 -0.051279 1.139972
184516 CAN00WH0 199806 -0.0656 -0.011516 1.101890
143275 CAN00WO0 199702 -0.1176 -0.025442 1.372437
145755 CAN00WO0 199703 -0.0074 -0.049279 0.031939
148247 CAN00WO0 199704 -0.0075 -0.005948 -0.535855
150775 CAN00WO0 199705 0.0451 0.122322 0.341747
Upvotes: 9
Reputation: 6383
I guess pd.rolling_apply
doesn't help in this case since it seems to me that it essentially only takes a Series
(Even if a dataframe is passed, it's processing one column a time). But you can always write your own rolling_apply that takes a dataframe.
import pandas as pd
import numpy as np
from StringIO import StringIO
df = pd.read_csv(StringIO(''' id period ret_1m mkt_ret_1m
131146 CAN00WG0 199609 -0.1538 0.047104
133530 CAN00WG0 199610 -0.0455 -0.014143
135913 CAN00WG0 199611 0.0000 0.040926
138334 CAN00WG0 199612 0.2952 0.008723
140794 CAN00WG0 199701 -0.0257 0.039916
143274 CAN00WG0 199702 -0.0038 -0.025442
145754 CAN00WG0 199703 -0.2992 -0.049279
148246 CAN00WG0 199704 -0.0919 -0.005948
150774 CAN00WG0 199705 0.0595 0.122322
153318 CAN00WG0 199706 -0.0337 0.045765
160980 CAN00WH0 199709 0.0757 0.079293
163569 CAN00WH0 199710 -0.0741 -0.044000
166159 CAN00WH0 199711 0.1000 -0.014644
168782 CAN00WH0 199712 -0.0909 -0.007072
171399 CAN00WH0 199801 -0.0100 0.001381
174022 CAN00WH0 199802 0.1919 0.081924
176637 CAN00WH0 199803 0.0085 0.050415
179255 CAN00WH0 199804 -0.0168 0.018393
181880 CAN00WH0 199805 0.0427 -0.051279
184516 CAN00WH0 199806 -0.0656 -0.011516
143275 CAN00WO0 199702 -0.1176 -0.025442
145755 CAN00WO0 199703 -0.0074 -0.049279
148247 CAN00WO0 199704 -0.0075 -0.005948
150775 CAN00WO0 199705 0.0451 0.122322'''), sep='\s+')
def calc_beta(df):
np_array = df.values
s = np_array[:,0] # stock returns are column zero from numpy array
m = np_array[:,1] # market returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
sub_df = df.iloc[max(i-period, 0):i,:] #I edited here
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]
result[idx] = func(sub_df)
return result
df['beta'] = np.nan
grp = df.groupby('id')
period = 6 #I'm using 6 to see some not NaN values, since sample data don't have longer than 12 groups
for stock, sub_df in grp:
beta = rolling_apply(sub_df[['ret_1m','mkt_ret_1m']], period, calc_beta, min_periods = period)
beta.name = 'beta'
df.update(beta)
print df
Output
id period ret_1m mkt_ret_1m beta
131146 CAN00WG0 199609 -0.1538 0.047104 NaN
133530 CAN00WG0 199610 -0.0455 -0.014143 NaN
135913 CAN00WG0 199611 0.0000 0.040926 NaN
138334 CAN00WG0 199612 0.2952 0.008723 NaN
140794 CAN00WG0 199701 -0.0257 0.039916 NaN
143274 CAN00WG0 199702 -0.0038 -0.025442 -1.245908
145754 CAN00WG0 199703 -0.2992 -0.049279 2.574464
148246 CAN00WG0 199704 -0.0919 -0.005948 2.657887
150774 CAN00WG0 199705 0.0595 0.122322 1.371090
153318 CAN00WG0 199706 -0.0337 0.045765 1.494095
... ... ... ... ... ...
171399 CAN00WH0 199801 -0.0100 0.001381 NaN
174022 CAN00WH0 199802 0.1919 0.081924 1.542782
176637 CAN00WH0 199803 0.0085 0.050415 1.605407
179255 CAN00WH0 199804 -0.0168 0.018393 1.571015
181880 CAN00WH0 199805 0.0427 -0.051279 1.139972
184516 CAN00WH0 199806 -0.0656 -0.011516 1.101890
143275 CAN00WO0 199702 -0.1176 -0.025442 NaN
145755 CAN00WO0 199703 -0.0074 -0.049279 NaN
148247 CAN00WO0 199704 -0.0075 -0.005948 NaN
150775 CAN00WO0 199705 0.0451 0.122322 NaN
Upvotes: 5