usag1r
usag1r

Reputation: 124

Python linear regression model (Pandas, statsmodels) - Value error: endog exog matrices size mismatch

A friend of mine asked me about this linear regression code and I also couldn't solve it, so now it's my question as well.

Error we are getting: ValueError: endog and exog matrices are different sizes

When I remove "Tech" from ind_names then it works fine. This might be pointless but for the sake of eliminating syntax error possibilities I tried doing it.

Tech and Financial industry labels are not equally distributed in the DataFrame so maybe this is causing a size mismatch? But I couldn't debug any further so decided to ask you guys.

It'd be really nice to get some confirmation on the error and solution ideas. Please find code below.

    #We have a portfolio constructed of 3 randomly generated factors (fac1, fac2, fac3). 
#Python code provides the following message 
#ValueError: The indices for endog and exog are not aligned

import pandas as pd
from numpy.random import rand
import numpy as np
import statsmodels.api as sm

fac1, fac2, fac3 = np.random.rand(3, 1000) #Generate  random factors

#Consider a collection of hypothetical stock portfolios
#Generate randomly 1000 tickers
import random; random.seed(0)
import string
N = 1000
def rands(n):
  choices = string.ascii_uppercase
  return ''.join([random.choice(choices) for _ in range(n)])


tickers = np.array([rands(5) for _ in range(N)])
ticker_subset = tickers.take(np.random.permutation(N)[:1000])

#Weighted sum of factors plus noise

port = pd.Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000), index=ticker_subset)
factors = pd.DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3}, index=ticker_subset)

#Correlations between each factor and the portfolio 
#print(factors.corrwith(port))
factors1=sm.add_constant(factors)


#Calculate factor exposures using a regression estimated by OLS
#print(sm.OLS(np.asarray(port), np.asarray(factors1)).fit().params)

#Calculate the exposure on each industry
def beta_exposure(chunk, factors=None):
    return sm.OLS(np.asarray(chunk), np.asarray(factors)).fit().params


#Assume that we have only two industries – financial and tech

ind_names = np.array(['Financial', 'Tech'])
#Create a random industry classification 

sampler = np.random.randint(0, len(ind_names), N)
industries = pd.Series(ind_names[sampler], index=tickers, name='industry')
by_ind = port.groupby(industries)



exposures=by_ind.apply(beta_exposure, factors=factors1)
print(exposures)
#exposures.unstack()

#Determinate the exposures on each industry 

Upvotes: 2

Views: 8830

Answers (1)

Zev
Zev

Reputation: 3491

Understanding the error message:

ValueError: endog and exog matrices are different sizes

Okay, not too bad. The endogenous matrix and exogenous matrix are of different sizes. And the module provides this page which tells that endogenous are the factors within the system and exogenous are factors outside it.

Some debugging

Check what shapes we are getting for our arrays. To do that we need to take apart that oneliner and print of the .shape of the arguments or maybe print the first handful of each. Also, comment out the line throwing the error. So there, we discover that we get:

chunk [490]
factor [1000    4]
chunk [510]
factor [1000    4]

Oh! There it is. We were expecting factor to be chunked too. It should be [490 4] the first time and [510 4] the second time. Note: since the categories are assigned randomly, this will differ each time.

So basically we have just too much info in that function. We can use the chunk to see what factors to chose, filter the factors to be just those and then everything will work.

Looking over the function definitions in the docs:

class statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)

We are just passing two arguments and the rest are optional. Let's look at the two we are passing.

endog (array-like) – 1-d endogenous response variable. The dependent variable.

exog (array-like) – A nobs x k array where nobs is the number of observations and k is the number of regressors...

Ah, endog and exog again. endog is 1d array-like. So far so good, shape 490 works. exog nobs? Oh, its number of observations. So it's a 2d array and in this case, we need shape 490 by 4.

This specific issue:

beta_exposure should be:

def beta_exposure(chunk, factors=None):
    factors = factors.loc[factors.index.isin(chunk.index)]
    return sm.OLS(np.asarray(chunk), np.asarray(factors)).fit().params

The issue is that you are applying beta_exposures to each part of the list (it is randomized, so let's say 490 elements for Financial and 510 for Tech) but factors=factors1 always gives you 1000 values (the groupby code doesn't touch that).

See http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html and http://www.statsmodels.org/dev/endog_exog.html for the references I used researching this.

Upvotes: 6

Related Questions