Reputation: 99
I have some football data (2020/2021 Serie A), and I would like to compute the number of games each team played over the last n days (lets say to simplify over the last 30 days). Conditions are thus the team, the day the game is played (strictly smaller than) and that same day - 30 (greater or equal to).
I would like to know what is the best way to do that using data.table (alone), and more important, the logic behind the code. I woudl go for a loop over the teams and dates, but I think it is cumbersome and I am sure there is a way to have it done in one single row.
A sample is given below, with the result I would expect (day and date might seem misleading, because some games were postponed, but that is not important. Data is classified by date). Thank you very much.
Code | Team | Date | Day | Date - 30d | Games played over the last 30 days |
---|---|---|---|---|---|
TORATA | Atalanta | 2020-09-26 | 2 | 2020-08-27 | NA |
LAZATA | Atalanta | 2020-09-30 | 1 | 2020-08-31 | 1 |
ATACAG | Atalanta | 2020-10-04 | 3 | 2020-09-04 | 2 |
NAPATA | Atalanta | 2020-10-17 | 4 | 2020-09-17 | 3 |
ATASAM | Atalanta | 2020-10-24 | 5 | 2020-09-24 | 4 |
CROATA | Atalanta | 2020-10-31 | 6 | 2020-10-01 | 3 |
ATAINT | Atalanta | 2020-11-08 | 7 | 2020-10-09 | 3 |
Upvotes: 0
Views: 1244
Reputation: 160397
Here's one implementation, just data.table
and base R:
dat[, z := sapply(Date, function(z) sum(between(z - Date, 0.1, 30)))]
dat
# Code Team Date Day Date...30d Games.played.over.the.last.30.days z
# <char> <char> <Date> <int> <Date> <int> <int>
# 1: TORATA Atalanta 2020-09-26 2 2020-08-27 NA 0
# 2: LAZATA Atalanta 2020-09-30 1 2020-08-31 1 1
# 3: ATACAG Atalanta 2020-10-04 3 2020-09-04 2 2
# 4: NAPATA Atalanta 2020-10-17 4 2020-09-17 3 3
# 5: ATASAM Atalanta 2020-10-24 5 2020-09-24 4 4
# 6: CROATA Atalanta 2020-10-31 6 2020-10-01 3 3
# 7: ATAINT Atalanta 2020-11-08 7 2020-10-09 3 3
In this case, for each Date
value, we count how many of the dates are within 30 days of it.
If you need the NA
in place of a 0
, then you can add on dat[z < 1, z := NA]
or similar.
Data:
library(data.table)
dat <- structure(list(Code = c("TORATA", "LAZATA", "ATACAG", "NAPATA", "ATASAM", "CROATA", "ATAINT"), Team = c("Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta", "Atalanta"), Date = structure(c(18531, 18535, 18539, 18552, 18559, 18566, 18574), class = "Date"), Day = c(2L, 1L, 3L, 4L, 5L, 6L, 7L), Date...30d = structure(c(18501, 18505, 18509, 18522, 18529, 18536, 18544), class = "Date"), Games.played.over.the.last.30.days = c(NA, 1L, 2L, 3L, 4L, 3L, 3L)), class = c("data.table", "data.frame"), row.names = c(NA, -7L))
setDT(dat)
Upvotes: 1
Reputation: 24722
You can get this with one line of code, using a non-equi join of the table onto itself.
Let's say fb
is your input data (without the Games30days
column). Like this:
Code Team Date Day Date - 30d
1: TORATA Atalanta 2020-09-26 2 2020-08-27
2: LAZATA Atalanta 2020-09-30 1 2020-08-31
3: ATACAG Atalanta 2020-10-04 3 2020-09-04
4: NAPATA Atalanta 2020-10-17 4 2020-09-17
5: ATASAM Atalanta 2020-10-24 5 2020-09-24
6: CROATA Atalanta 2020-10-31 6 2020-10-01
7: ATAINT Atalanta 2020-11-08 7 2020-10-09
Then, just do a join on Team=Team
, Date<Date
, and Date>Date - 30d
, like this:
games_played = fb[fb,on=.(Team = Team, Date<Date, Date>`Date - 30d`), nomatch=0][,.("Games30" = .N), .(Date,Team)]
which returns
Date Team Games30
1: 2020-09-30 Atalanta 1
2: 2020-10-04 Atalanta 2
3: 2020-10-17 Atalanta 3
4: 2020-10-24 Atalanta 4
5: 2020-10-31 Atalanta 3
6: 2020-11-08 Atalanta 3
That result can easily be joined back to the original, to get all the columns, like this:
games_played[fb, on=.(Team, Date)]
Upvotes: 0
Reputation: 41210
You could use runner
in combination with data.table
to calculate a running Date
window count:
library(data.table)
library(runner)
setDT(data)
data[,Date:=as.Date(Date,'%Y-%m-%d')]
data[,N:=runner::runner(
x = Date,
k = 30, # 30-days window
lag = 1,
idx = Date,
f = length)
,by=Team][]
Code Team Date Day Date30d Games30days N
1: TORATA Atalanta 2020-09-26 2 2020-08-27 NA 0
2: LAZATA Atalanta 2020-09-30 1 2020-08-31 1 1
3: ATACAG Atalanta 2020-10-04 3 2020-09-04 2 2
4: NAPATA Atalanta 2020-10-17 4 2020-09-17 3 3
5: ATASAM Atalanta 2020-10-24 5 2020-09-24 4 4
6: CROATA Atalanta 2020-10-31 6 2020-10-01 3 3
7: ATAINT Atalanta 2020-11-08 7 2020-10-09 3 3
Data:
data <- read.table(text='
Code Team Date Day Date30d Games30days
TORATA Atalanta 2020-09-26 2 2020-08-27 NA
LAZATA Atalanta 2020-09-30 1 2020-08-31 1
ATACAG Atalanta 2020-10-04 3 2020-09-04 2
NAPATA Atalanta 2020-10-17 4 2020-09-17 3
ATASAM Atalanta 2020-10-24 5 2020-09-24 4
CROATA Atalanta 2020-10-31 6 2020-10-01 3
ATAINT Atalanta 2020-11-08 7 2020-10-09 3',header=T)
Upvotes: 0