Reputation: 129
I have a dataset and want to create a column called stock
. In particular, the rule is the following.
stock
based on the quantity in the previous month.stock
vanishes after three months (e.g., if an agent won on 2020-02-01
, this stock disappears when they participate in a game on 2020-06-01
.A
, B
, etc.How do I create such a column using tibble
?
Case 1: Simple version
date = c("2020-01-01", "2020-02-01", "2020-03-01","2020-04-01", "2020-05-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-08-01")
id = c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
win = c(0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)
quantity = c(60, 50, 50, 100, 10, 10, 100, 100, 60, 50, 50, 10, 10, 100)
dat <- tibble(date = as.Date(date), id = id, win = win, quantity = quantity)
date id win quantity stock
<date> <chr> <dbl> <dbl> <dbl>
2020-01-01 A 0 60 0
2020-02-01 A 1 50 0
2020-03-01 A 0 50 50 ## have 50 for 3 months because A win in the previous month
2020-04-01 A 1 100 50
2020-05-01 A 0 10 150 ## have 100 for 3 months because A win in the previous month
2020-06-01 A 0 10 100 ## disappear 50 after 3 months
2020-07-01 A 0 100 100 ## disappear 50 after 3 months
2020-08-01 A 0 100 0 ## disappear 100 after 3 months
2020-01-01 B 0 60 0
2020-02-01 B 0 50 0
2020-03-01 B 0 50 0
2020-04-01 B 1 10 0
2020-05-01 B 0 10 10
2020-08-01 B 0 100 0
Case2: Real data
gameid = c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "B1", "B2", "B3", "B4", "B5", "B6")
date = c("2020-01-01", "2020-02-01", "2020-03-01","2020-04-01", "2020-05-01", "2020-06-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-08-01")
id = c("A", "A", "A", "A", "A", "A", "A", "A","A", "B", "B", "B", "B", "B", "B")
win = c(0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0)
quantity = c(60, 50, 50, 100, 10, NA, NA, 100, 100, 60, 50, 50, 10, 10, 100)
dat = tibble(gameid = gameid, date = as.Date(date), id = id, win = win, quantity = quantity)
> dat
# A tibble: 15 × 5
gameid date id win quantity
<chr> <date> <chr> <dbl> <dbl>
1 A1 2020-01-01 A 0 60
2 A2 2020-02-01 A 1 50
3 A3 2020-03-01 A 0 50
4 A4 2020-04-01 A 1 100
5 A5 2020-05-01 A 0 10
6 A6 2020-06-01 A 0 NA
7 A7 2020-06-01 A 0 NA # The Case 1 cannot distinguish this part.
8 A8 2020-07-01 A 1 100
9 A9 2020-08-01 A 0 100
10 B1 2020-01-01 B 0 60
11 B2 2020-02-01 B 0 50
12 B3 2020-03-01 B 0 50
13 B4 2020-04-01 B 1 10
14 B5 2020-05-01 B 0 10
15 B6 2020-08-01 B 0 100
Upvotes: 2
Views: 125
Reputation: 160407
Complete rewrite, look in the history for the previous (incomplete/incorrect) answer.
While this looks like a rolling join, the fact that you want to consider the gaps means it's a range-based or non-equi join. For this, we'll use one of these packages: data.table
, fuzzyjoin
(dplyr-style), or sqldf
. We'll also use lubridate
to facilitate "3 months ago".
library(lubridate)
library(data.table)
datDT <- as.data.table(dat) # should use setDT(dat) if you're really going this route
datDT[, .(id, fromdate = date, todate = date %m+% months(3),
w2 = win, q2 = quantity)
][datDT, on = .(id, fromdate < date, todate >= date)
][, .(stock = sum(c(0, q2[w2 > 0]))), by = .(gameid, date = fromdate, id, win, quantity)
][ is.na(stock), stock := 0 ][]
# gameid date id win quantity stock
# 1: A1 2020-01-01 A 0 60 0
# 2: A2 2020-02-01 A 1 50 0
# 3: A3 2020-03-01 A 0 50 50
# 4: A4 2020-04-01 A 1 100 50
# 5: A5 2020-05-01 A 0 10 150
# 6: A6 2020-06-01 A 0 NA 100
# 7: A7 2020-06-01 A 0 NA 100
# 8: A8 2020-07-01 A 1 100 100
# 9: A9 2020-08-01 A 0 100 100
# 10: B1 2020-01-01 B 0 60 0
# 11: B2 2020-02-01 B 0 50 0
# 12: B3 2020-03-01 B 0 50 0
# 13: B4 2020-04-01 B 1 10 0
# 14: B5 2020-05-01 B 0 10 10
# 15: B6 2020-08-01 B 0 100 0
library(dplyr)
# library(fuzzyjoin) # fuzzy_left_join
library(lubridate)
fuzzyjoin::fuzzy_left_join(
dat, transform(dat, todate = date %m+% months(3)),
by = c(id = "id", date = "date", date = "todate"),
match_fun = list(`==`, `>`, `<=`)) %>%
group_by(gameid = gameid.x, date = date.x, id = id.x, win = win.x, quantity = quantity.x) %>%
summarize(stock = sum(quantity.y[win.y > 0]), .groups = "drop") %>%
mutate(stock = coalesce(stock, 0)) %>%
arrange(id, date)
# # A tibble: 15 x 6
# gameid date id win quantity stock
# <chr> <date> <chr> <dbl> <dbl> <dbl>
# 1 A1 2020-01-01 A 0 60 0
# 2 A2 2020-02-01 A 1 50 0
# 3 A3 2020-03-01 A 0 50 50
# 4 A4 2020-04-01 A 1 100 50
# 5 A5 2020-05-01 A 0 10 150
# 6 A6 2020-06-01 A 0 NA 100
# 7 A7 2020-06-01 A 0 NA 100
# 8 A8 2020-07-01 A 1 100 100
# 9 A9 2020-08-01 A 0 100 100
# 10 B1 2020-01-01 B 0 60 0
# 11 B2 2020-02-01 B 0 50 0
# 12 B3 2020-03-01 B 0 50 0
# 13 B4 2020-04-01 B 1 10 0
# 14 B5 2020-05-01 B 0 10 10
# 15 B6 2020-08-01 B 0 100 0\
This is using a SQLite backend, but most (all?) of sqldf
's backends should work with minimal change. We pre-calculate the todate
into the original frame here.
library(lubridate)
sqldf::sqldf("
select t1.gameid, t1.date, t1.id, t1.win, t1.quantity,
sum(case when t2.win > 0 then t2.quantity else 0 end) as stock
from dat t1
left join dat t2 on t1.id = t2.id
and t2.todate between t1.date and (t1.todate - 1)
group by t1.gameid, t1.date, t1.id, t1.win, t1.quantity
order by t1.id, t1.date")
# gameid date id win quantity stock
# 1 A1 2020-01-01 A 0 60 0
# 2 A2 2020-02-01 A 1 50 0
# 3 A3 2020-03-01 A 0 50 50
# 4 A4 2020-04-01 A 1 100 50
# 5 A5 2020-05-01 A 0 10 150
# 6 A6 2020-06-01 A 0 NA 100
# 7 A7 2020-06-01 A 0 NA 100
# 8 A8 2020-07-01 A 1 100 100
# 9 A9 2020-08-01 A 0 100 100
# 10 B1 2020-01-01 B 0 60 0
# 11 B2 2020-02-01 B 0 50 0
# 12 B3 2020-03-01 B 0 50 0
# 13 B4 2020-04-01 B 1 10 0
# 14 B5 2020-05-01 B 0 10 10
# 15 B6 2020-08-01 B 0 100 0
Upvotes: 2