Reputation: 35792

Efficiently determining the probability of a user clicking a hyperlink

So I have a bunch of hyperlinks on a web page. From past observation I know the probabilities that a user will click on each of these hyperlinks. I can therefore calculate the mean and standard deviation of these probabilities.

I now add a new hyperlink to this page. After a short amount of testing I find that of the 20 users that see this hyperlink, 5 click on it.

Taking into account the known mean and standard deviation of the click-through probabilities on other hyperlinks (this forms a "prior expectation"), how can I efficiently estimate the probability of a user clicking on the new hyperlink?

A naive solution would be to ignore the other probabilities, in which case my estimate is just 5/20 or 0.25 - however this means we are throwing away relevant information, namely our prior expectation of what the click-through probability is.

So I'm looking for a function that looks something like this:

double estimate(double priorMean, 
                double priorStandardDeviation, 
                int clicks, int views);

I'd ask that, since I'm more familiar with code than mathematical notation, that any answers use code or pseudocode in preference to math.

Upvotes: 0

Answers (4)

bayer

Reputation: 6904

I made this a new answer since it's fundamentally different.

This is based on Chris Bishop, Machine Learning and Pattern Recognition, Chapter 2 "Probability Distributions" p71++ and http://en.wikipedia.org/wiki/Beta_distribution.

First we fit a beta distribution to the given mean and variance in order to build a distribution over the parametes. Then we return the mode of the distribution which is the expected parameter for a bernoulli variable.

def estimate(prior_mean, prior_variance, clicks, views):
  c = ((prior_mean * (1 - prior_mean)) / prior_variance - 1)
  a = prior_mean * c
  b = (1 - prior_mean) * c
  return ((a + clicks) - 1) / (a + b + views - 2)

However, I am quite positive that the prior mean/variance will not work for you since you throw away information about how many samples you have and how good your prior thus is.

Instead: Given a set of (webpage, link_clicked) pairs, you can calculate the number of pages a specific link was clicked on. Let that be m. Let the amount of times that link was not clicked be l.

Now let a be the number of clicks to your new link be a and the number of visits to the site be b. Then your probability of your new link is

def estimate(m, l, a, b):
  (m + a) / (m + l + a + b)

Which looks pretty trivial but actually has a valid probabilistic foundation. From the implementation perspective, you can keep m and l globally.

Upvotes: 3

bayer

Reputation: 6904

P/N is actually correct from a frequentist perspective.

You could also use a bayesian approach to incorporate prior knowledge, but since you don't seem to have that knowledge, I guess P/N is the way to go.

If you want, you can also use Laplace's rule which iirc comes down to a uniform prior. Just give each link on the page a start of 1 instead of 0. (So if you count the number a link was clicked, give each a +1 bonus and resemble that in your N.)

[UPDATE] Here is a bayesian approach:

Let p(W) be the probability that a person is in a specific group W. Let p(L) be the probability, that a specific link is clicked. then the probability you are looking for is p(L|W). By Bayes' theorem, you can calculate this by

p(L|W) = p(W|L) * p(L) / p(W)

You can estimate p(L) by the amount L was clicked, p(W) by the size of that group with respect to the rest of the users and p(W|L) = p(W and L) / p(L) by the number of persons of the specific group W that clicked L divided by the probability that L is clicked.

Upvotes: 2

Christian

Reputation: 26427

You need to know how strongly X is correlated with W.

Most likely you also want to have a more complex mathematical model if you want to develop a big website. If you run a website like digg you have a lot of prior knowledge that you have to factor into your calcualtion. That leads to multivariate statistics.

Upvotes: 0

nlucaroni

Reputation: 47944

Bayes' Theorem Proof:

P(A,B) = P( A | B ) * P( B )    (1)

since,

P(A,B) = P(B,A)                 (2)

And substituting (2) with (1),

P(A | B) * P( B ) = P (B | A) * P(A)

thus (Bayes' Theorem),

           P( B | A ) * P(A)
P(A | B) = -----------------
                 P(B)

P(A)   -- prior/marginal probability of A, may or may not take into account B
P(A|B) -- conditional/posterior probability of A, given B.
P(B|A) -- conditional probability of B given A.
P(B)   -- prior/marginal probability of B

Consequences,

P( A | B ) = P( A ), then a and b are independent
P( B | A ) = P( B ), and then

and the definition of independence is,

P(A,B) = P(A | B) * P( B ) = P( A )* P( B )

It should be noted, that it is easy to manipulate the probability to your liking by changing the priors and the way the problem is thought of, take a look at this discussion of the Anthropic Principle and Bayes' Theorem.

Upvotes: 0

Efficiently determining the probability of a user clicking a hyperlink

Answers (4)

Related Questions