Reputation: 445
This is a follow on question from Cumulative Mean with Grouping and Lag and Grouped moving average in r.
I'm looking to create a cumulative mean field with a lag of one that groups over multiple variables but is only calculating the average on certain criteria. So for the example below, S-AVG only gives the cumulative mean for S and vice versa for O-AVG and J-AVG. I'm sure this is possible using ave and cumsum, but am unsure how to do it.
Here is the desired output:
Player Goals **S-AVG** **O-AVG** **J-AVG**
S 5
S 2 5
S 7 3.5
O 3
O 9 3
O 6 6
O 3 3
S 7 4.66
O 1 5.25
S 7 5.25
S 3 5.6
Q 8 4.4
S 3 5.16
O 4 5
P 1 4.857
S 9 4.857
S 4 5.375
Z 6 4.375
S 3 5.22
O 8 4.55
S 3 5
O 4 4.9
O 1 4.81
S 9 4.81
S 4 5.16
O 6 4.5
J 6
Here is the data input for r
Player <- c('S','S','S','O','O','O','O','S','O','S','S','O','S','O','O','S','S','O','S','O','S','O','O','S','S','O','J')
Goals <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1,9,4,6,3,8,3,4,1,9,4,6,6)
data.frame(Player, Goals)
Any help is appreciated.
Upvotes: 1
Views: 422
Reputation: 269674
Suppose DF2
is the data frame computed in my answer to the prior post referenced in the question, i.e. the data frame having the AVG
column. It is also reproduced in the Note at the end of this answer.
If we only had one or a small fixed number of players we could do this by writing out an AVG.*
column for each player (shown for one player):
transform(DF2, AVG.S = ifelse(Player == "S", AVG, NA))
but a more general approach follows. Set levs
to the levels of the Player
factor or if you don't want all players then levs
should be set to a character vector of just the players you want. Then use sapply
to contruct a logical matrix and convert that to a matrix of 1s and NAs which is then scalar multiplied by AVG
.
The solution has a number of desirable features -- it does not overwrite its input (which would be error prone) and it avoids needless repeated qualification (both thanks to transform
), it uses the whole object approach rather than loops and subscripting, it leverages existing code avoiding duplication (by using the result of the prior solution that this question is a follow-on to) and is brief -- two lines of code. It uses no packages.
(Also note that as an alternative sapply(...)
could be replaced with model.matrix(~ Player + 0)
in which case the column names would be slightly different.)
levs <- levels(DF2$Player)
transform(DF2, Avg = ifelse(sapply(levs, `==`, Player), 1, NA) * AVG)
giving:
Player Goals AVG Avg.J Avg.O Avg.S
1 S 5 NA NA NA NA
2 S 2 5.000000 NA NA 5.000000
3 S 7 3.500000 NA NA 3.500000
4 O 3 NA NA NA NA
5 O 9 3.000000 NA 3.000000 NA
6 O 6 6.000000 NA 6.000000 NA
7 O 3 6.000000 NA 6.000000 NA
8 S 7 4.666667 NA NA 4.666667
9 O 1 5.250000 NA 5.250000 NA
10 S 7 5.250000 NA NA 5.250000
11 S 3 5.600000 NA NA 5.600000
12 O 8 4.400000 NA 4.400000 NA
13 S 3 5.166667 NA NA 5.166667
14 O 4 5.000000 NA 5.000000 NA
15 O 1 4.857143 NA 4.857143 NA
16 S 9 4.857143 NA NA 4.857143
17 S 4 5.375000 NA NA 5.375000
18 O 6 4.375000 NA 4.375000 NA
19 S 3 5.222222 NA NA 5.222222
20 O 8 4.555556 NA 4.555556 NA
21 S 3 5.000000 NA NA 5.000000
22 O 4 4.900000 NA 4.900000 NA
23 O 1 4.818182 NA 4.818182 NA
24 S 9 4.818182 NA NA 4.818182
25 S 4 5.166667 NA NA 5.166667
26 O 6 4.500000 NA 4.500000 NA
27 J 6 NA NA NA NA
Note: This was used as input above:
DF2 <- structure(list(Player = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 2L,
3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 2L, 2L,
3L, 3L, 2L, 1L), .Label = c("J", "O", "S"), class = "factor"),
Goals = c(5, 2, 7, 3, 9, 6, 3, 7, 1, 7, 3, 8, 3, 4, 1, 9,
4, 6, 3, 8, 3, 4, 1, 9, 4, 6, 6), AVG = c(NA, 5, 3.5, NA,
3, 6, 6, 4.66666666666667, 5.25, 5.25, 5.6, 4.4, 5.16666666666667,
5, 4.85714285714286, 4.85714285714286, 5.375, 4.375, 5.22222222222222,
4.55555555555556, 5, 4.9, 4.81818181818182, 4.81818181818182,
5.16666666666667, 4.5, NA)), .Names = c("Player", "Goals",
"AVG"), row.names = c(NA, -27L), class = "data.frame")
Upvotes: 3
Reputation: 108553
Another approach is simply using indices. First make a function cummean
(which is trivial...):
cummean <- function(x){
cumsum(x) / seq_along(x)
}
Then calculate the cumulative means and store in a list (simplify = FALSE
) :
avgs <- with(mydf,
tapply(Goals,Player,cummean,
simplify = FALSE))
Finally, create the variables based on the player names, conveniently added as names of the list returned by tapply
. I specifically use a for
loop to avoid having to rebuild the complete data frame every time. Using indices, I can fill up the data frame in a quite more efficient way and still have that lag you want. :
for(i in names(avgs)){
theavg <- avgs[[i]]
mydf[[i]][mydf$Player == i] <- c(NA, theavg[-length(theavg)])
}
Upvotes: 2