szd116
szd116

Reputation: 172

Calculating AIC number manually Given a distribution of data and some distribution string

Suppose I have the following data:

 array([[0.88574245, 0.3749999 , 0.39727183, 0.50534724],
        [0.22034441, 0.81442653, 0.19313024, 0.47479565],
        [0.46585887, 0.68170517, 0.85030437, 0.34167736],
        [0.18960739, 0.25711086, 0.71884116, 0.38754042]])

and knowing that this data follows normal distribution. How do I calculate the AIC number ? The formula is

2K - 2log(L)

K is the total parameters, for normal distribution the parameter is 3(mean,variance and residual). i'm stuck on L, L is suppose to be the maximum likelihood function, I'm not sure what to pass in there for data that follows normal distribution, how about for Cauchy or exponential. Thank you.

Update: this question appeared in one of my coding interview. 

Upvotes: 2

Views: 3669

Answers (2)

StupidWolf
StupidWolf

Reputation: 46908

For a given normal distribution, the probability of y given

import scipy.stats

def prob( y = 0, mean = 0, sd = 1 ):
    return scipy.stats.norm( mean, sd ).pdf( y )

For example, given mean = 0 and sd = 1, the probability of value 0, is prob( 0, 0, 1 )

If we have a set of values 0 - 9, the log likelihood is the sum of the log of these probabilities, in this case the best parameters are the mean of x and StDev of x, as in :

import numpy as np
x = range( 9 )
logLik = sum( np.log( prob( x, np.mean( x ), np.std( x ) ) ) ) 

Then AIC is simply:

K = 2
2*K - 2*( logLik )

For the data you provide, I am not so sure what the three columns and row reflect. So do you have to calculate three means and three StDev-s? It's not very clear.

Hopefully this above can get you started

Upvotes: 4

Robert Dodier
Robert Dodier

Reputation: 17576

I think the interview question leaves out some stuff, but maybe part of the point is to see how you handle that.

Anyway, AIC is essentially a penalized log likelihood calculation. Log likelihood is great -- the greater the log likelihood, the better the model fits the data. However, if you have enough free parameters, you can always make the log likelihood greater. Hmm. So various penalty terms, which counter the effect of more free parameters, have been proposed. AIC (Akaike Information Criterion) is one of them.

So the problem, as it is stated, is (1) find the log likelihood for each of the three models given (normal, exponential, and Cauchy), (2) count up the free parameters for each, and (3) calculate AIC from (1) and (2).

Now for (1) you need (1a) to look up or derive the maximum likelihood estimator for each model. For normal, it's just the sample mean and sample variance. I don't remember the others, but you can look them up, or work them out. Then (1b) you need to apply the estimators to the given data, and then (1c) calculate the likelihood, or equivalently, the log likelihood of the estimated parameters for the given data. The log likelihood of any parameter value is just sum(log(p(x|params))) where params = parameters as estimated by maximum likelihood.

As for (2), there are 2 parameters for a normal distribution, mu and sigma^2. For an exponential, there's 1 (it might be called lambda or theta or something). For a Cauchy, there might be a scale parameter and a location parameter. Or, maybe there are no free parameters (centered at zero and scale = 1). So in each case, K = 1 or 2 or maybe K = 0, 1, or 2.

Going back to (1b), the data look a little funny to me. I would expect a one dimensional list, but it seems like the array is two dimensional (with 4 rows and 4 columns if I counted right). One might need to go back and ask about that. If they really mean to have 4 dimensional data, then the conceptual basis remains the same, but the calculations are going to be a little more complex than in the 1-d case.

Good luck and have fun, it's a good problem.

Upvotes: 4

Related Questions