user3070752
user3070752

Reputation: 734

what does Maximum Likelihood Estimation exactly mean?

When we are training our model we usually use MLE to estimate our model. I know it means that the most probable data for such a learned model is our training set. But I'm wondering if its probability match 1 exactly or not?

Upvotes: 1

Views: 2734

Answers (3)

val
val

Reputation: 8689

The example I got in my stat classes was as follows:

A suspect is on the run ! Nothing is known about them, except that they're approximatively 1m80 tall. Should the police look for a man or a woman ?

The idea here is that you have a parameter for your model (M/F), and probabilities given that parameter. There are tall men, tall women, short men and short women. However, in the absence of any other information, the probability of a man being 1m80 is larger than the probability of a woman being 1m80. Likelihood (as bogatron very well explained) is a formalisation of that, and maximum likelihood is the estimation method based on favouring parameters which are more likely to result in the actual observations.

But that's just a toy example, with a single binary variable... Let's expand it a bit: I threw two identical die, and the sum of their value is 7. How many side did my die have ? Well, we all know that the probability of two D6 summing to 7 is quite high. But it might as well be D4, D20, D100, ... However, P(7 | 2D6) > P(7 | 2D20), and P(7 | 2D6) > P(7 | 2D100) ..., so you might estimate that my die are 6-faced. That doesn't mean it's true, but its a reasonable estimation, in the absence of any additional information.

That's better, but we're not in machine-learning territory yet... Let's get there: if you want to fit your umptillion-layer neural network on some empirical data, you can consider all possible parameterisations, and how likely each of them is to return the empirical data. That's exploring an umptillion-dimensional space, each dimensions having infinitely many possibilities, but you can map every single one of these points to a likelihood. It is then reasonable to fit your network using these parameters: given that the empirical data did occur, it is reasonable to assume that they should be likely under your model.

That doesn't mean that your parameters are likely ! Just that under these parameters, the observed value is likely. Statistical estimation is usually not a closed problem with a single solution (like solving an equation might be, and where you would have a probability of 1), but we need to find a best solution, according to some metric. Likelihood is such a metric, and is used widely because it has some interesting properties:

  • It makes intuitive sense
  • It's reasonably simple to compute, fit and optimise, for a large family of models
  • For normal variables (which tend to crop up everywhere) MLE gives the same results as other methods, such as least-squares estimations
  • Its formulation in terms of conditional probabilities makes it easy to use/manipulate it in Bayesian frameworks

Upvotes: 3

bogatron
bogatron

Reputation: 19169

You almost have it right. The Likelihood of a model (theta) for the observed data (X) is the probability of observing X, given theta:

L(theta|X) = P(X|theta)

For Maximum Likelihood Estimation (MLE), you choose the value of theta that provides the greatest value of P(X|theta). This does not necessarily mean that the observed value of X is the most probable for the MLE estimate of theta. It just means that there is no other value of theta that would provide a higher probability for the observed value of X.

In other words, if T1 is the MLE estimate of theta, and if T2 is any other possible value of theta, then P(X|T1) > P(X|T2). However, there still could be another possible value of the data (Y) different than the observed data (X) such that P(Y|T1) > P(X|T1).

The probability of X for the MLE estimate of theta is not necessarily 1 (and probably never is except for trivial cases). This is expected since X can take multiple values that have non-zero probabilities.

Upvotes: 5

Arnab Bhadury
Arnab Bhadury

Reputation: 108

To build on what bogatron said with an example, the parameters learned from MLE are the ones that explain the data you see (and nothing else) the best. And no, the probability is not 1 (except in trivial cases).

As an example (that has been used billions of times) of what MLE does is:

If you have a simple coin-toss problem, and you observe 5 results of coin tosses (H, H, H, T, H) and you do MLE, you will end up giving p(coin_toss == H) a high probability (0.80) because you see Heads way too many times. There are good and bad things about MLE obviously...

Pros: It is an optimization problem, so it is generally quite fast to solve (even if there isn't an analytical solution). Cons: It can overfit when there isn't a lot of data (like our coin-toss example).

Upvotes: 3

Related Questions