user14241
user14241

Reputation: 748

log-likelihood in statsmodels vs pymc

I am trying to work through my first maximum likelihood estimation in Python. One of the steps of this requires me to compute the likelihood of your model parameters. I found some sample data that can be summarized here:

import numpy as np
import pandas as pd
life_test = pd.DataFrame(columns=['points', 'time'])
life_test['points'] = np.linspace(1,14,14)
life_test['time'] = np.concatenate((np.linspace(5,40,8), np.linspace(50,100,6)), axis=0)

If I run a simple model through statsmodels.api. I get a value of -14.601 from the results.summary().

import statsmodels.api as sm
endog=np.array(life_test['points'])
exog=np.array(life_test['time'])
exog = sm.add_constant(exog)
results = sm.OLS(endog, exog).fit()
results.summary()

Looking in the source for OLS it seems this is the basic calculation for log-likelihood

params = np.array(results.params)
nobs2=results.nobs/2.0 # decimal point is critical here!
-nobs2*np.log(2*np.pi)-nobs2*np.log(1.0/(2*nobs2) *\
    np.dot(np.transpose(endog - np.dot(exog, params)),\
    (endog - np.dot(exog,params)))) - nobs2

When I try to implement this with PyMC I get different results. It may be some miscalculation of the loc and scale on my part.

import pymc.distributions as dist
mu = exog.mean()
sigma = exog.std()
dist.normal_like(exog, mu, 1/sigma**2)

Here I get a value of -135.29. I feel I must be miscalculating my scale and loc values, but it may be some other error in my implementation. Perhaps OLS is using some other likelihood besides the normal log-likelihood? I am pretty new to statsmodels, PyMC, and MLE in general. Does anyone know what I'm doing wrong here?

Upvotes: 1

Views: 3134

Answers (1)

Dalek
Dalek

Reputation: 4318

you can compare the result from statsmodels with sklearn using:

>>> x=sklearn.linear_model.LinearRegression(fit_intercept=False).fit(exog,endog)
>>> x.coef_
array([ 1.45714286,  0.13428571])

which is comparable with

>>> sm.OLS(endog, exog).fit().params
array([ 1.45714286,  0.13428571])

the results are consistant. On the otherhand, it seems you just computed the likelihood of fitting a gaussian to exog data which is different from linear-reqression.

In order to recreate linear regression with pymc you need to do as following:

  • defining your model free parameters with a set of priors
  • give your input data to pass through your model with different values of free parameters
  • At the end, set up your Gaussian likelihood

So the implementaton with pymc is:

life_test = pd.DataFrame(columns=['points', 'time'])
life_test['points'] = np.linspace(1,14,14)
life_test['time'] = np.concatenate((np.linspace(5,40,8), np.linspace(50,100,6)), axis=0)
endog=np.array(life_test['points'])
exog=np.array(life_test['time'])
alpha = pm.Normal('alpha', mu=0, tau=2)
beta = pm.Normal('beta', mu=0, tau=2)
sigma = pm.Uniform('sigma', lower=0, upper=1)
y_est = alpha + beta * exog
radon_like = pm.Normal('y', mu=y_est, tau=sigma, observed=True,value=endog)
model = dict(rand_like=radon_like,alpha=alpha,beta=beta,sigma=sigma)
S = pm.MCMC(model)
S.sample(iter=100000,burn=1000)
pm.Matplot.plot(S)

If you compute the loglikelihood with the following procedure, you will get the close results by using pm.normal_like distribution:

>>> results = sm.OLS(endog, exog).fit()
>>> y_est = results.params[0] + results.params[1] * exog[:,1]
>>> pm.normal_like(endog, y_est, 1/np.sqrt(y_est.std()))
-19.348540432740464

Upvotes: 3

Related Questions