Alex
Alex

Reputation: 165

Generating a column of probabilities in r for a dataframe

I have a dataframe dfof football matches and corresponding mean scoring rates for each team. The first 2 columns are Home and Away teams, the 3rd and 4th columns are mean scoring rates for Home and Away teams respectively. For example:

       Home      Away HomeRate AwayRate
1   Arsenal   Bristol      1.3      1.3
2   Chelsea Newcastle      2.4      2.3
3 Liverpool     Leeds      3.3      1.7
4    Bolton    Fulham      2.1      2.5

I want to generate 3 new columns, "HomeWin", "Draw", "AwayWin" giving Poisson probabilities of home win, draw, away win respectively for each match and add these columns to my dataframe. I assume events are independent.

I can do this manually for each individual match separately. But I don't know how to do this all at once for the whole table. Take for instance 2nd row (Chelsea vs Newcastle). I can create a probability matrix consisting of probabilities for each respective scoreline between Chelsea and Newcastle (0-0, 0-1 0-2, etc):

H <- c(dpois(0:20, df[2,3]), ppois(20, df[2,3], lower.tail=FALSE))
A <- c(dpois(0:20, df[2,4]), ppois(20, df[2,3], lower.tail=FALSE))
ProbMatrix = outer(H, A)

In the above block, df[2,3] is Chelsea mean score rate, df[2,4] is Newcastle mean score rate and I am assuming max number of possible goals scored for each side is 20 (hence I am using dpois(0:20) and ppois(20) in the above, to get the total sum of probabilities as close to 1 as possible.

Then I can simply sum over various elements of the matrix above to generate Home, Away and Draw win probabilities:

Prob_Home_Win = sum(ProbMatrix[lower.tri(ProbMatrix)])
Prob_Away_Win = sum(ProbMatrix[upper.tri(ProbMatrix)])
Prob_Draw = sum(diag(ProbMatrix))

But how can I generate efficiently the whole HomeWin, AwayWin and Draw columns yielding the probabilities for each match in df ?

Upvotes: 2

Views: 95

Answers (2)

AkselA
AkselA

Reputation: 8846

The distribution of the difference between two independent Poisson samples is called a Skellam distribution, and there is an R implementation of this (skellam)

We can use this to simplify the code greatly:

library(skellam)

hda <- function(home, away, q=-20:20) {
    ds <- dskellam(q, home, away)
    c(HomeWin=sum(ds[q>0]), Draw=ds[q==0], AwayWin=sum(ds[q<0]))
}

# Sanity test
hda(1, 1)
#    HomeWin       Draw    AwayWin 
# 0.34574584 0.30850832 0.34574584 
hda(1, 3)
#     HomeWin        Draw     AwayWin 
# 0.093863113 0.131121595 0.775015291 

rates <- read.table(text="
       Home      Away HomeRate AwayRate
1   Arsenal   Bristol      1.3      1.3
2   Chelsea Newcastle      2.4      2.3
3 Liverpool     Leeds      3.3      1.7
4    Bolton    Fulham      2.1      2.5")

t(mapply(hda, rates[,3], rates[,4]))
#         HomeWin       Draw    AwayWin
# [1,] 0.36804300 0.26391400 0.36804300
# [2,] 0.42311686 0.18953470 0.38734844
# [3,] 0.68709516 0.14524034 0.16766450
# [4,] 0.33355639 0.18898591 0.47745770

Upvotes: 7

Axeman
Axeman

Reputation: 35387

apply a function to your data.frame, e.g. like so:

win_rate <- function(goal_rates) {
  H <- c(dpois(0:20, goal_rates[1]), ppois(20, goal_rates[1], lower.tail=FALSE)
  # !! note goal_rates[1] for `ppois`, not sure if you had a typo...
  A <- c(dpois(0:20, goal_rates[2]), ppois(20, goal_rates[2], lower.tail=FALSE)) 

  ProbMatrix = outer(H, A)
  Prob_Home_Win = sum(ProbMatrix[lower.tri(ProbMatrix)])
  Prob_Away_Win = sum(ProbMatrix[upper.tri(ProbMatrix)])
  Prob_Draw = sum(diag(ProbMatrix))

  return(c(Prob_Home_Win, Prob_Away_Win, Prob_Draw))
}
results <- apply(df[3:4], 1, win_rate) |> 
   t() |> as.data.frame() |> setNames(c('HomeWin', 'HomeLoss', 'Draw'))

cbind(df, results)
       Home      Away HomeRate AwayRate   HomeWin  HomeLoss      Draw
1   Arsenal   Bristol      1.3      1.3 0.3680430 0.3680430 0.2639140
2   Chelsea Newcastle      2.4      2.3 0.4231169 0.3873484 0.1895347
3 Liverpool     Leeds      3.3      1.7 0.6870952 0.1676645 0.1452403
4    Bolton    Fulham      2.1      2.5 0.3335564 0.4774577 0.1889859

Upvotes: 1

Related Questions