Coding a Gaussian log-likelihood in R

Question

I am trying to learn R by coding a Gaussian log-likelihood to solve with optim(), but after hours of sweat I am still way off the mark. (This is self-study, not homework.)

I am following the convention in many user-written functions that write a function like loglik <- function(theta, y, x) where theta is a vector of weights beta and variance sigma, y is the outcome and x is the data.

My full code with simulated data is below. Running it, you will see that my function is way off the mark compared to lm(). Can anyone give me an idea as to where I am going wrong?

# random data
set.seed(111)
y = sample(1:100,100)
x1 = sample(1:100,100)*rnorm(1,0)
x2 = sample(x1)*rnorm(1,0)
x3 = sample(x2)*rnorm(1,0)
dat = data.frame(x1,x2,x3)

# define gaussian log-likelihood
logLik <- function(theta, Y, X){
  X           <- as.matrix(X) # convert data to matrix
  k           <- ncol(X) # get the number of columns (independent vars)
  beta        <- theta[1:k] # vector of weights intialized with starting values
  expected_y  <- X %*% beta  # X is dimension (n x k) and beta is dimension (k x 1)
  sigma2      <- theta[k+1] # should be pulled from the last of the starting values vector
  LL          <- sum(dnorm(Y, mean = expected_y, sd = sigma2, log = T)) # sum of the PDF over each observation
  return(-LL)
}

Here is the output:

> optim(logLik, par=starting_values, method="Nelder-Mead", Y=y, X=dat, hessian = T)$par
[1]   0.4832514  -0.2276684  -0.3860800  32.7168490 -38.9030319
> coefficients(lm(y~x1+x2+x3))
(Intercept)          x1          x2          x3 
58.17347451 -0.06587320  0.13001865 -0.03624233

John Fox · Accepted Answer

The basics of your approach are sound but some of the details are wrong. First, it makes sense to construct the data according to a Gaussian linear model; for example

set.seed(111)
X <- cbind(1, matrix(rnorm(100*3), 100, 3))
y <- X %*% rep(1, 4) + rnorm(100, 0, 2)

starting.values <- c(1, 1, 1, 1, 2) # actual parameters

# define gaussian log-likelihood
logLik <- function(theta, y, X){
  k           <- ncol(X) # get the number of columns (independent vars)
  beta        <- theta[1:k] # vector of weights intialized with starting values
  expected_y  <- X %*% beta  # X is dimension (n x k) and beta is dimension (k x 1)
  sigma      <- theta[k+1] # should be pulled from the last of the starting values vector
  LL          <- sum(dnorm(y, mean = expected_y, sd = sigma, log = TRUE)) # sum of the PDF over each observation
  return(-LL)
}

By the way, the *norm() functions are parametrized in terms of the SD, not the variance.

Then

> optim(logLik, par=starting.values, y=y, X=X, method="BFGS")$par
[1] 1.0471420 1.1411523 0.8167656 0.9840397 1.8910201
Warning message:
In dnorm(y, mean = expected_y, sd = sigma, log = TRUE) : NaNs produced

> summary(lm(y ~ X - 1))

Call:
lm(formula = y ~ X - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5062 -1.3293  0.1371  1.2057  5.8116 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
X1   1.0471     0.1952   5.365 5.61e-07 ***
X2   1.1412     0.1818   6.275 1.00e-08 ***
X3   0.8168     0.1907   4.282 4.39e-05 ***
X4   0.9840     0.2122   4.638 1.11e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.93 on 96 degrees of freedom
Multiple R-squared:  0.5333,    Adjusted R-squared:  0.5138 
F-statistic: 27.42 on 4 and 96 DF,  p-value: 3.468e-15

Notice that method="BFGS" issues a warning but produces the right answer; method="Nelder-Mead" is slightly less accurate. Also notice that the usual estimate of the SD of the errors differs from the ML estimate.

I hope this helps

Coding a Gaussian log-likelihood in R

Answers (1)

Related Questions