greeny
greeny

Reputation: 445

Cumulative Mean with Conditions

This is a follow on question from Cumulative Mean with Grouping and Lag and Grouped moving average in r.

I'm looking to create a cumulative mean field with a lag of one that groups over multiple variables but is only calculating the average on certain criteria. So for the example below, S-AVG only gives the cumulative mean for S and vice versa for O-AVG and J-AVG. I'm sure this is possible using ave and cumsum, but am unsure how to do it.

Here is the desired output:

Player  Goals   **S-AVG**    **O-AVG**    **J-AVG**
S       5                         
S       2       5
S       7       3.5         
O       3                    
O       9                     3
O       6                     6      
O       3                     3 
S       7       4.66         
O       1                     5.25
S       7       5.25         
S       3       5.6         
Q       8                     4.4            
S       3       5.16                  
O       4                     5           
P       1                     4.857
S       9       4.857         
S       4       5.375         
Z       6                     4.375        
S       3       5.22         
O       8                     4.55            
S       3       5                  
O       4                     4.9      
O       1                     4.81      
S       9       4.81                  
S       4       5.16                  
O       6                     4.5       
J       6                     

Here is the data input for r

Player <- c('S','S','S','O','O','O','O','S','O','S','S','O','S','O','O','S','S','O','S','O','S','O','O','S','S','O','J')
Goals <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1,9,4,6,3,8,3,4,1,9,4,6,6)
data.frame(Player, Goals)

Any help is appreciated.

Upvotes: 1

Views: 422

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269674

Suppose DF2 is the data frame computed in my answer to the prior post referenced in the question, i.e. the data frame having the AVG column. It is also reproduced in the Note at the end of this answer.

If we only had one or a small fixed number of players we could do this by writing out an AVG.* column for each player (shown for one player):

transform(DF2, AVG.S = ifelse(Player == "S", AVG, NA))

but a more general approach follows. Set levs to the levels of the Player factor or if you don't want all players then levs should be set to a character vector of just the players you want. Then use sapply to contruct a logical matrix and convert that to a matrix of 1s and NAs which is then scalar multiplied by AVG.

The solution has a number of desirable features -- it does not overwrite its input (which would be error prone) and it avoids needless repeated qualification (both thanks to transform), it uses the whole object approach rather than loops and subscripting, it leverages existing code avoiding duplication (by using the result of the prior solution that this question is a follow-on to) and is brief -- two lines of code. It uses no packages.

(Also note that as an alternative sapply(...) could be replaced with model.matrix(~ Player + 0) in which case the column names would be slightly different.)

levs <- levels(DF2$Player)
transform(DF2, Avg = ifelse(sapply(levs, `==`, Player), 1, NA) * AVG)

giving:

   Player Goals      AVG Avg.J    Avg.O    Avg.S
1       S     5       NA    NA       NA       NA
2       S     2 5.000000    NA       NA 5.000000
3       S     7 3.500000    NA       NA 3.500000
4       O     3       NA    NA       NA       NA
5       O     9 3.000000    NA 3.000000       NA
6       O     6 6.000000    NA 6.000000       NA
7       O     3 6.000000    NA 6.000000       NA
8       S     7 4.666667    NA       NA 4.666667
9       O     1 5.250000    NA 5.250000       NA
10      S     7 5.250000    NA       NA 5.250000
11      S     3 5.600000    NA       NA 5.600000
12      O     8 4.400000    NA 4.400000       NA
13      S     3 5.166667    NA       NA 5.166667
14      O     4 5.000000    NA 5.000000       NA
15      O     1 4.857143    NA 4.857143       NA
16      S     9 4.857143    NA       NA 4.857143
17      S     4 5.375000    NA       NA 5.375000
18      O     6 4.375000    NA 4.375000       NA
19      S     3 5.222222    NA       NA 5.222222
20      O     8 4.555556    NA 4.555556       NA
21      S     3 5.000000    NA       NA 5.000000
22      O     4 4.900000    NA 4.900000       NA
23      O     1 4.818182    NA 4.818182       NA
24      S     9 4.818182    NA       NA 4.818182
25      S     4 5.166667    NA       NA 5.166667
26      O     6 4.500000    NA 4.500000       NA
27      J     6       NA    NA       NA       NA

Note: This was used as input above:

DF2 <- structure(list(Player = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 2L, 
3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 
3L, 3L, 2L, 1L), .Label = c("J", "O", "S"), class = "factor"), 
    Goals = c(5, 2, 7, 3, 9, 6, 3, 7, 1, 7, 3, 8, 3, 4, 1, 9, 
    4, 6, 3, 8, 3, 4, 1, 9, 4, 6, 6), AVG = c(NA, 5, 3.5, NA, 
    3, 6, 6, 4.66666666666667, 5.25, 5.25, 5.6, 4.4, 5.16666666666667, 
    5, 4.85714285714286, 4.85714285714286, 5.375, 4.375, 5.22222222222222, 
    4.55555555555556, 5, 4.9, 4.81818181818182, 4.81818181818182, 
    5.16666666666667, 4.5, NA)), .Names = c("Player", "Goals", 
"AVG"), row.names = c(NA, -27L), class = "data.frame")

Upvotes: 3

Joris Meys
Joris Meys

Reputation: 108553

Another approach is simply using indices. First make a function cummean (which is trivial...):

cummean <- function(x){
  cumsum(x) / seq_along(x)
}

Then calculate the cumulative means and store in a list (simplify = FALSE ) :

avgs <- with(mydf,
             tapply(Goals,Player,cummean,
                    simplify = FALSE))

Finally, create the variables based on the player names, conveniently added as names of the list returned by tapply. I specifically use a for loop to avoid having to rebuild the complete data frame every time. Using indices, I can fill up the data frame in a quite more efficient way and still have that lag you want. :

for(i in names(avgs)){
  theavg <- avgs[[i]]
  mydf[[i]][mydf$Player == i] <- c(NA, theavg[-length(theavg)])
}

Upvotes: 2

Related Questions