Reputation: 33
I have soccer results data in the following format (thousands of observations):
Div date value pts
1 E0 2011-08-13 Blackburn 0.0
2 E0 2011-08-13 Fulham 0.5
3 E0 2011-08-13 Liverpool 0.5
4 E0 2011-08-13 Newcastle 0.5
5 E0 2011-08-13 QPR 0.0
6 E0 2011-08-13 Wigan 0.5
7 E0 2011-08-14 Stoke 0.5
8 E0 2011-08-14 West Brom 0.0
9 E0 2011-08-15 Man City 1.0
10 E0 2011-08-20 Arsenal 0.0
11 E0 2011-08-20 Aston Villa 1.0
plus other variables. "value" is the team, pts is the final result (win/loss/draw) as a numerical value. I'm trying to add a new variable which is the average of this value over the last X games for the team in that row. How do I do this without using some horrible loop?
Upvotes: 3
Views: 798
Reputation: 27398
This can be done quite efficiently with tapply
. I've altered your data somewhat by duplicating teams' games, with random scores and dates. This takes the mean of the most recent 2 games, as specified in the tail
function.
# create some data
d <- structure(list(Div = structure(rep(1L, 33), .Label = " E0",
class = "factor"), date = structure(c(15013, 14990, 14996, 15001, 14995, 15006,
15020, 15032, 15023, 15022, 15015, 15016, 15034, 14994, 14986, 14998, 14982,
14979, 14980, 15016, 15031, 15013, 15031, 14999, 15025, 14978, 15007, 15026,
14992, 14997, 15023, 14986, 15028), class = "Date"),
value = structure(c(3L, 4L, 5L, 7L, 8L, 11L, 9L, 10L, 6L, 1L, 2L, 3L, 4L, 5L,
7L, 8L, 11L, 9L, 10L, 6L, 1L, 2L, 3L, 4L, 5L, 7L, 8L, 11L, 9L, 10L, 6L, 1L,
2L), .Label = c("Arsenal", "Aston Villa", "Blackburn", "Fulham", "Liverpool",
"Man City", "Newcastle", "QPR", "Stoke", "West Brom", "Wigan"),
class = "factor"), pts = c(0.5, 0.5, 0.5, 1, 1, 1, 1, 0, 1, 0.5, 0, 1, 1, 1, 1,
0.5, 0.5, 0, 0.5, 0.5, 0, 0, 0, 1, 0, 0, 0.5, 0, 1, 0, 0.5, 0.5, 0.5)),
.Names = c("Div", "date", "value", "pts"), row.names = c(NA, 33L),
class = "data.frame")
# sort rows by date
d2 <- d[order(d$date),]
# mean of all games
tapply(d2$pts, d2$value, mean)
# mean of last 2 games
tapply(d2$pts, d2$value, function(x) mean(tail(x, 2)))
# To tidy up the output, you could use simplify=FALSE and do.call(rbind, x):
# e.g., mean of last 2 games:
do.call(rbind, tapply(d2$pts, d2$value, function(x) mean(tail(x, 2)),
simplify=F))
[,1]
Arsenal 0.25
Aston Villa 0.25
Blackburn 0.50
Fulham 1.00
Liverpool 0.25
Man City 0.75
Newcastle 1.00
QPR 0.50
Stoke 1.00
West Brom 0.00
Wigan 0.50
Upvotes: 1
Reputation: 23918
Try ave
function from stats.
Trt <- gl(n=2, k=3, length=2*3, labels =c("A", "B"))
Y <- 1:6
Data <- data.frame(Trt, Y)
Data
Trt Y
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
Data$TrtMean <- ave(Y, Trt, FUN=mean)
Data
Trt Y TrtMean
1 A 1 2
2 A 2 2
3 A 3 2
4 B 4 5
5 B 5 5
6 B 6 5
Upvotes: 1
Reputation: 43265
take a look at this
using the zoo package and rollmean
and the plyr package's ddply
:
library(zoo)
library(plyr)
dat <- data.frame(value=letters[1:5], pts=sample(c(0, 0.5, 1), 50, replace=T))
ddply(dat, .(value), summarise, rollmean(pts, k=5, align='right'))
however, as far as I understand a "rolling average" it shortens your data set by definition. you can supply a fill argument though:
ddply(dat, .(value), summarise, rollmean(pts, k=5, fill=NA, align='right'))
Upvotes: 3