log-likelihood in statsmodels vs pymc

Question

I am trying to work through my first maximum likelihood estimation in Python. One of the steps of this requires me to compute the likelihood of your model parameters. I found some sample data that can be summarized here:

import numpy as np
import pandas as pd
life_test = pd.DataFrame(columns=['points', 'time'])
life_test['points'] = np.linspace(1,14,14)
life_test['time'] = np.concatenate((np.linspace(5,40,8), np.linspace(50,100,6)), axis=0)

If I run a simple model through statsmodels.api. I get a value of -14.601 from the results.summary().

import statsmodels.api as sm
endog=np.array(life_test['points'])
exog=np.array(life_test['time'])
exog = sm.add_constant(exog)
results = sm.OLS(endog, exog).fit()
results.summary()

Looking in the source for OLS it seems this is the basic calculation for log-likelihood

params = np.array(results.params)
nobs2=results.nobs/2.0 # decimal point is critical here!
-nobs2*np.log(2*np.pi)-nobs2*np.log(1.0/(2*nobs2) *\
    np.dot(np.transpose(endog - np.dot(exog, params)),\
    (endog - np.dot(exog,params)))) - nobs2

When I try to implement this with PyMC I get different results. It may be some miscalculation of the loc and scale on my part.

import pymc.distributions as dist
mu = exog.mean()
sigma = exog.std()
dist.normal_like(exog, mu, 1/sigma**2)

Here I get a value of -135.29. I feel I must be miscalculating my scale and loc values, but it may be some other error in my implementation. Perhaps OLS is using some other likelihood besides the normal log-likelihood? I am pretty new to statsmodels, PyMC, and MLE in general. Does anyone know what I'm doing wrong here?

Dalek · Accepted Answer

you can compare the result from statsmodels with sklearn using:

>>> x=sklearn.linear_model.LinearRegression(fit_intercept=False).fit(exog,endog)
>>> x.coef_
array([ 1.45714286,  0.13428571])

which is comparable with

>>> sm.OLS(endog, exog).fit().params
array([ 1.45714286,  0.13428571])

the results are consistant. On the otherhand, it seems you just computed the likelihood of fitting a gaussian to exog data which is different from linear-reqression.

In order to recreate linear regression with pymc you need to do as following:

defining your model free parameters with a set of priors
give your input data to pass through your model with different values of free parameters
At the end, set up your Gaussian likelihood

So the implementaton with pymc is:

life_test = pd.DataFrame(columns=['points', 'time'])
life_test['points'] = np.linspace(1,14,14)
life_test['time'] = np.concatenate((np.linspace(5,40,8), np.linspace(50,100,6)), axis=0)
endog=np.array(life_test['points'])
exog=np.array(life_test['time'])
alpha = pm.Normal('alpha', mu=0, tau=2)
beta = pm.Normal('beta', mu=0, tau=2)
sigma = pm.Uniform('sigma', lower=0, upper=1)
y_est = alpha + beta * exog
radon_like = pm.Normal('y', mu=y_est, tau=sigma, observed=True,value=endog)
model = dict(rand_like=radon_like,alpha=alpha,beta=beta,sigma=sigma)
S = pm.MCMC(model)
S.sample(iter=100000,burn=1000)
pm.Matplot.plot(S)

If you compute the loglikelihood with the following procedure, you will get the close results by using pm.normal_like distribution:

>>> results = sm.OLS(endog, exog).fit()
>>> y_est = results.params[0] + results.params[1] * exog[:,1]
>>> pm.normal_like(endog, y_est, 1/np.sqrt(y_est.std()))
-19.348540432740464

log-likelihood in statsmodels vs pymc

Answers (1)

Related Questions