Didifr
Didifr

Reputation: 99

Cumulative Sum in R based on Date and other conditions using data.table

I have some football data (2020/2021 Serie A), and I would like to compute the number of games each team played over the last n days (lets say to simplify over the last 30 days). Conditions are thus the team, the day the game is played (strictly smaller than) and that same day - 30 (greater or equal to).

I would like to know what is the best way to do that using data.table (alone), and more important, the logic behind the code. I woudl go for a loop over the teams and dates, but I think it is cumbersome and I am sure there is a way to have it done in one single row.

A sample is given below, with the result I would expect (day and date might seem misleading, because some games were postponed, but that is not important. Data is classified by date). Thank you very much.

Code Team Date Day Date - 30d Games played over the last 30 days
TORATA Atalanta 2020-09-26 2 2020-08-27 NA
LAZATA Atalanta 2020-09-30 1 2020-08-31 1
ATACAG Atalanta 2020-10-04 3 2020-09-04 2
NAPATA Atalanta 2020-10-17 4 2020-09-17 3
ATASAM Atalanta 2020-10-24 5 2020-09-24 4
CROATA Atalanta 2020-10-31 6 2020-10-01 3
ATAINT Atalanta 2020-11-08 7 2020-10-09 3

Upvotes: 0

Views: 1244

Answers (3)

r2evans
r2evans

Reputation: 160397

Here's one implementation, just data.table and base R:

dat[, z := sapply(Date, function(z) sum(between(z - Date, 0.1, 30)))]
dat
#      Code     Team       Date   Day Date...30d Games.played.over.the.last.30.days     z
#    <char>   <char>     <Date> <int>     <Date>                              <int> <int>
# 1: TORATA Atalanta 2020-09-26     2 2020-08-27                                 NA     0
# 2: LAZATA Atalanta 2020-09-30     1 2020-08-31                                  1     1
# 3: ATACAG Atalanta 2020-10-04     3 2020-09-04                                  2     2
# 4: NAPATA Atalanta 2020-10-17     4 2020-09-17                                  3     3
# 5: ATASAM Atalanta 2020-10-24     5 2020-09-24                                  4     4
# 6: CROATA Atalanta 2020-10-31     6 2020-10-01                                  3     3
# 7: ATAINT Atalanta 2020-11-08     7 2020-10-09                                  3     3

In this case, for each Date value, we count how many of the dates are within 30 days of it.

If you need the NA in place of a 0, then you can add on dat[z < 1, z := NA] or similar.


Data:

library(data.table)
dat <- structure(list(Code = c("TORATA", "LAZATA", "ATACAG", "NAPATA", "ATASAM", "CROATA", "ATAINT"), Team = c("Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta"), Date = structure(c(18531, 18535, 18539, 18552, 18559, 18566, 18574), class = "Date"), Day = c(2L, 1L, 3L, 4L, 5L, 6L, 7L), Date...30d = structure(c(18501, 18505, 18509, 18522, 18529, 18536, 18544), class = "Date"), Games.played.over.the.last.30.days = c(NA, 1L, 2L, 3L, 4L, 3L, 3L)), class = c("data.table", "data.frame"), row.names = c(NA, -7L))
setDT(dat)

Upvotes: 1

langtang
langtang

Reputation: 24722

You can get this with one line of code, using a non-equi join of the table onto itself.

Let's say fb is your input data (without the Games30days column). Like this:

     Code     Team       Date Day Date - 30d
1: TORATA Atalanta 2020-09-26   2 2020-08-27
2: LAZATA Atalanta 2020-09-30   1 2020-08-31
3: ATACAG Atalanta 2020-10-04   3 2020-09-04
4: NAPATA Atalanta 2020-10-17   4 2020-09-17
5: ATASAM Atalanta 2020-10-24   5 2020-09-24
6: CROATA Atalanta 2020-10-31   6 2020-10-01
7: ATAINT Atalanta 2020-11-08   7 2020-10-09

Then, just do a join on Team=Team, Date<Date, and Date>Date - 30d, like this:

games_played = fb[fb,on=.(Team = Team, Date<Date, Date>`Date - 30d`), nomatch=0][,.("Games30" =  .N), .(Date,Team)]

which returns

         Date     Team Games30
1: 2020-09-30 Atalanta       1
2: 2020-10-04 Atalanta       2
3: 2020-10-17 Atalanta       3
4: 2020-10-24 Atalanta       4
5: 2020-10-31 Atalanta       3
6: 2020-11-08 Atalanta       3

That result can easily be joined back to the original, to get all the columns, like this:

games_played[fb, on=.(Team, Date)]

Upvotes: 0

Waldi
Waldi

Reputation: 41210

You could use runner in combination with data.table to calculate a running Date window count:

library(data.table)
library(runner)

setDT(data)

data[,Date:=as.Date(Date,'%Y-%m-%d')]

data[,N:=runner::runner(
                        x = Date, 
                        k = 30, # 30-days window
                        lag = 1,
                        idx = Date,
                        f = length)
    ,by=Team][]

     Code     Team       Date Day    Date30d Games30days N
1: TORATA Atalanta 2020-09-26   2 2020-08-27          NA 0
2: LAZATA Atalanta 2020-09-30   1 2020-08-31           1 1
3: ATACAG Atalanta 2020-10-04   3 2020-09-04           2 2
4: NAPATA Atalanta 2020-10-17   4 2020-09-17           3 3
5: ATASAM Atalanta 2020-10-24   5 2020-09-24           4 4
6: CROATA Atalanta 2020-10-31   6 2020-10-01           3 3
7: ATAINT Atalanta 2020-11-08   7 2020-10-09           3 3

Data:

data <- read.table(text='
Code    Team    Date    Day     Date30d     Games30days
TORATA  Atalanta    2020-09-26  2   2020-08-27  NA
LAZATA  Atalanta    2020-09-30  1   2020-08-31  1
ATACAG  Atalanta    2020-10-04  3   2020-09-04  2
NAPATA  Atalanta    2020-10-17  4   2020-09-17  3
ATASAM  Atalanta    2020-10-24  5   2020-09-24  4
CROATA  Atalanta    2020-10-31  6   2020-10-01  3
ATAINT  Atalanta    2020-11-08  7   2020-10-09  3',header=T)

Upvotes: 0

Related Questions