This may be a very basic question, but I'm currently going through Andrew Ng's CS229 notes on Learning Theory (specifically PAC Learning). What I see is that the error on a given hypothesis is less than or equal to the error on the best possible hypothesis + an expression inside Big O Notation: From my understanding, Big O notation has to do with convergence of some function. How do I interpret this Big O notation? As someone without a heavy math background, I don't know if I should allow all the vars d, m, and delta to approach infinity, or just plug in values there and ignore the O

machine-learningbig-ocomplexity-theorytheory

Reputation: 101

Big O Notation in PAC Learning

This may be a very basic question, but I'm currently going through Andrew Ng's CS229 notes on Learning Theory (specifically PAC Learning). What I see is that the error on a given hypothesis is less than or equal to the error on the best possible hypothesis + an expression inside Big O Notation:

E(h-hat)<=E(h*)+O(sqrt((d/m)log(m/d)+(1/m)log(1/delta))

From my understanding, Big O notation has to do with convergence of some function. How do I interpret this Big O notation? As someone without a heavy math background, I don't know if I should allow all the vars d, m, and delta to approach infinity, or just plug in values there and ignore the O

Upvotes: 3

Answers (2)

Dima Chubarov

Reputation: 17179

I'd like to add to @Engineero post. Here is a generic interpretation of the Big-O notation.

The inequality a < b + O(f(d,m,delta) can be interpreted as

There exists a number K > 0 independent from d, m, or delta, such that for any value of d, m and delta
a < b + K * f(d,m,delta)

For those who are familiar with the quantifier notation used in mathematical logic and elsewhere this is exactly

 (Exists) K > 0 (For all) d, m, delta ( a < b + K * f(d,m,delta) )

Here a stays for e(h^), b stays for e(h*), and f(d,m,delta) for sqrt[d/m log(m/d) + 1/m log(1/delta)]

Upvotes: 2

Engineero

Reputation: 12938

You need a bit more information here to answer the question, but looking at the notes we have:

h^: a specific hypothesis drawn from domain H
h*: the best hypothesis in domain H
d: the number of parameters used to define h^ and h*
m: the number of samples used to learn h^
delta: the bound on probability that our inequality holds

Basically what the equation says is that with probability 1 - delta, you can guarantee that prediction error of a hypothesis drawn from a given domain of hypotheses is bounded uniformly by the prediction error of the best hypotheses in this domain as m grows.

The interesting thing about this is that it allows you to plan your data collection around what kind of generalization error guarantees you want to achieve. So if you wanted to know the bounds on your error to within 99% probability for an algorithm that is parameterized by 10 parameters dependent on how many data samples you have, you would set delta = 0.01, d = 10, and then calculate the part in the O(...) as you increase m from 1 to however many data samples you think is reasonable. Plotting that as you vary m is one way of determining how many data samples is reasonable, and planning your data collection accordingly.

Upvotes: 3

Big O Notation in PAC Learning

Answers (2)

Related Questions