user1185563
user1185563

Reputation: 33

R: Calculating average by team

I have soccer results data in the following format (thousands of observations):

     Div  date       value     pts
1    E0 2011-08-13   Blackburn 0.0
2    E0 2011-08-13      Fulham 0.5
3    E0 2011-08-13   Liverpool 0.5
4    E0 2011-08-13   Newcastle 0.5
5    E0 2011-08-13         QPR 0.0
6    E0 2011-08-13       Wigan 0.5
7    E0 2011-08-14       Stoke 0.5
8    E0 2011-08-14   West Brom 0.0
9    E0 2011-08-15    Man City 1.0
10   E0 2011-08-20     Arsenal 0.0
11   E0 2011-08-20 Aston Villa 1.0

plus other variables. "value" is the team, pts is the final result (win/loss/draw) as a numerical value. I'm trying to add a new variable which is the average of this value over the last X games for the team in that row. How do I do this without using some horrible loop?

Upvotes: 3

Views: 798

Answers (3)

jbaums
jbaums

Reputation: 27398

This can be done quite efficiently with tapply. I've altered your data somewhat by duplicating teams' games, with random scores and dates. This takes the mean of the most recent 2 games, as specified in the tail function.

# create some data
d <- structure(list(Div = structure(rep(1L, 33), .Label = " E0", 
  class = "factor"), date = structure(c(15013, 14990, 14996, 15001, 14995, 15006, 
  15020, 15032, 15023, 15022, 15015, 15016, 15034, 14994, 14986, 14998, 14982, 
  14979, 14980, 15016, 15031, 15013, 15031, 14999, 15025, 14978, 15007, 15026, 
  14992, 14997, 15023, 14986, 15028), class = "Date"), 
  value = structure(c(3L, 4L, 5L, 7L, 8L, 11L, 9L, 10L, 6L, 1L, 2L, 3L, 4L, 5L, 
  7L, 8L, 11L, 9L, 10L, 6L, 1L, 2L, 3L, 4L, 5L, 7L, 8L, 11L, 9L, 10L, 6L, 1L, 
  2L), .Label = c("Arsenal", "Aston Villa", "Blackburn", "Fulham", "Liverpool",
  "Man City", "Newcastle", "QPR", "Stoke", "West Brom", "Wigan"), 
  class = "factor"), pts = c(0.5, 0.5, 0.5, 1, 1, 1, 1, 0, 1, 0.5, 0, 1, 1, 1, 1, 
  0.5, 0.5, 0, 0.5, 0.5, 0, 0, 0, 1, 0, 0, 0.5, 0, 1, 0, 0.5, 0.5, 0.5)), 
  .Names = c("Div", "date", "value", "pts"), row.names = c(NA, 33L), 
  class = "data.frame")

# sort rows by date
d2 <- d[order(d$date),]
# mean of all games
tapply(d2$pts, d2$value, mean)
# mean of last 2 games
tapply(d2$pts, d2$value, function(x) mean(tail(x, 2)))

# To tidy up the output, you could use simplify=FALSE and do.call(rbind, x):
# e.g., mean of last 2 games:
do.call(rbind, tapply(d2$pts, d2$value, function(x) mean(tail(x, 2)), 
  simplify=F))

            [,1]
Arsenal     0.25
Aston Villa 0.25
Blackburn   0.50
Fulham      1.00
Liverpool   0.25
Man City    0.75
Newcastle   1.00
QPR         0.50
Stoke       1.00
West Brom   0.00
Wigan       0.50

Upvotes: 1

MYaseen208
MYaseen208

Reputation: 23918

Try ave function from stats.

Trt <- gl(n=2, k=3, length=2*3, labels =c("A", "B"))
Y <- 1:6
Data <- data.frame(Trt, Y)
 Data
  Trt Y
1   A 1
2   A 2
3   A 3
4   B 4
5   B 5
6   B 6
Data$TrtMean <- ave(Y, Trt, FUN=mean)
Data
  Trt Y TrtMean
1   A 1       2
2   A 2       2
3   A 3       2
4   B 4       5
5   B 5       5
6   B 6       5

Upvotes: 1

Justin
Justin

Reputation: 43265

take a look at this

using the zoo package and rollmean and the plyr package's ddply:

library(zoo)
library(plyr)
dat <- data.frame(value=letters[1:5], pts=sample(c(0, 0.5, 1), 50, replace=T))
ddply(dat, .(value), summarise, rollmean(pts, k=5, align='right'))

however, as far as I understand a "rolling average" it shortens your data set by definition. you can supply a fill argument though:

ddply(dat, .(value), summarise, rollmean(pts, k=5, fill=NA, align='right'))

Upvotes: 3

Related Questions