CoolGuyHasChillDay
CoolGuyHasChillDay

Reputation: 747

How to mimick ROW_NUMBER() OVER(...) in R

To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:

ID   Day    Person    Cost
1     1        A        50     
2     1        B        25     
3     2        A        30     
4     3        B        75     
5     4        A        35     
6     4        B        100    
7     6        B        65     
8     7        A        20     

I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):

ID   Day    Person    Cost     Prev2
5     4        A        35     40
6     4        B        100    50
7     6        B        65     90
8     7        A        20     35

I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!

Upvotes: 1

Views: 1733

Answers (1)

alistaire
alistaire

Reputation: 43344

You could lag zoo::rollapplyr with a width of 2. In dplyr,

library(dplyr)

df %>% arrange(Day) %>%    # sort
    group_by(Person) %>%    # set grouping
    mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#> 
#>      ID   Day Person  Cost Prev2
#>   <int> <int> <fctr> <int> <dbl>
#> 1     1     1      A    50    NA
#> 2     2     1      B    25    NA
#> 3     3     2      A    30    NA
#> 4     4     3      B    75    NA
#> 5     5     4      A    35  40.0
#> 6     6     4      B   100  50.0
#> 7     7     6      B    65  87.5
#> 8     8     7      A    20  32.5

or all in dplyr,

df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)

which returns the same thing. In base,

df <- df[order(df$Day), ]

df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
    c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})

df
#>   ID Day Person Cost Prev2
#> 1  1   1      A   50    NA
#> 2  2   1      B   25    NA
#> 3  3   2      A   30    NA
#> 4  4   3      B   75    NA
#> 5  5   4      A   35  40.0
#> 6  6   4      B  100  50.0
#> 7  7   6      B   65  87.5
#> 8  8   7      A   20  32.5

or without zoo,

df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
    (c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})

which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.

Upvotes: 2

Related Questions