Reputation: 124
A friend of mine asked me about this linear regression code and I also couldn't solve it, so now it's my question as well.
Error we are getting: ValueError: endog and exog matrices are different sizes
When I remove "Tech" from ind_names then it works fine. This might be pointless but for the sake of eliminating syntax error possibilities I tried doing it.
Tech and Financial industry labels are not equally distributed in the DataFrame so maybe this is causing a size mismatch? But I couldn't debug any further so decided to ask you guys.
It'd be really nice to get some confirmation on the error and solution ideas. Please find code below.
#We have a portfolio constructed of 3 randomly generated factors (fac1, fac2, fac3).
#Python code provides the following message
#ValueError: The indices for endog and exog are not aligned
import pandas as pd
from numpy.random import rand
import numpy as np
import statsmodels.api as sm
fac1, fac2, fac3 = np.random.rand(3, 1000) #Generate random factors
#Consider a collection of hypothetical stock portfolios
#Generate randomly 1000 tickers
import random; random.seed(0)
import string
N = 1000
def rands(n):
choices = string.ascii_uppercase
return ''.join([random.choice(choices) for _ in range(n)])
tickers = np.array([rands(5) for _ in range(N)])
ticker_subset = tickers.take(np.random.permutation(N)[:1000])
#Weighted sum of factors plus noise
port = pd.Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000), index=ticker_subset)
factors = pd.DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3}, index=ticker_subset)
#Correlations between each factor and the portfolio
#print(factors.corrwith(port))
factors1=sm.add_constant(factors)
#Calculate factor exposures using a regression estimated by OLS
#print(sm.OLS(np.asarray(port), np.asarray(factors1)).fit().params)
#Calculate the exposure on each industry
def beta_exposure(chunk, factors=None):
return sm.OLS(np.asarray(chunk), np.asarray(factors)).fit().params
#Assume that we have only two industries – financial and tech
ind_names = np.array(['Financial', 'Tech'])
#Create a random industry classification
sampler = np.random.randint(0, len(ind_names), N)
industries = pd.Series(ind_names[sampler], index=tickers, name='industry')
by_ind = port.groupby(industries)
exposures=by_ind.apply(beta_exposure, factors=factors1)
print(exposures)
#exposures.unstack()
#Determinate the exposures on each industry
Upvotes: 2
Views: 8830
Reputation: 3491
ValueError: endog and exog matrices are different sizes
Okay, not too bad. The endogenous matrix and exogenous matrix are of different sizes. And the module provides this page which tells that endogenous are the factors within the system and exogenous are factors outside it.
Check what shapes we are getting for our arrays. To do that we need to take apart that oneliner and print of the .shape
of the arguments or maybe print the first handful of each. Also, comment out the line throwing the error. So there, we discover that we get:
chunk [490]
factor [1000 4]
chunk [510]
factor [1000 4]
Oh! There it is. We were expecting factor to be chunked too. It should be [490 4] the first time and [510 4] the second time. Note: since the categories are assigned randomly, this will differ each time.
So basically we have just too much info in that function. We can use the chunk to see what factors to chose, filter the factors to be just those and then everything will work.
class statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)
We are just passing two arguments and the rest are optional. Let's look at the two we are passing.
endog (array-like) – 1-d endogenous response variable. The dependent variable.
exog (array-like) – A nobs x k array where nobs is the number of observations and k is the number of regressors...
Ah, endog
and exog
again. endog
is 1d array-like. So far so good, shape 490
works. exog
nobs? Oh, its number of observations. So it's a 2d array and in this case, we need shape 490
by 4
.
beta_exposure
should be:
def beta_exposure(chunk, factors=None):
factors = factors.loc[factors.index.isin(chunk.index)]
return sm.OLS(np.asarray(chunk), np.asarray(factors)).fit().params
The issue is that you are applying beta_exposures to each part of the list (it is randomized, so let's say 490 elements for Financial
and 510 for Tech
) but factors=factors1
always gives you 1000 values (the groupby
code doesn't touch that).
See http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html and http://www.statsmodels.org/dev/endog_exog.html for the references I used researching this.
Upvotes: 6