hrkshr
hrkshr

Reputation: 129

Mutate a variable with a few conditions

I have a dataset and want to create a column called stock. In particular, the rule is the following.

  1. If an agent won the game in the previous month, they have a stock based on the quantity in the previous month.
  2. This stock vanishes after three months (e.g., if an agent won on 2020-02-01, this stock disappears when they participate in a game on 2020-06-01.
  3. There are many agents A, B, etc.

How do I create such a column using tibble?

Case 1: Simple version

date = c("2020-01-01", "2020-02-01", "2020-03-01","2020-04-01", "2020-05-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-08-01")
id = c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
win = c(0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)
quantity = c(60, 50, 50, 100, 10, 10, 100, 100, 60, 50, 50, 10, 10, 100)
dat <- tibble(date = as.Date(date), id = id, win = win, quantity = quantity)

date       id    win   quantity stock
<date>     <chr> <dbl> <dbl>   <dbl>
2020-01-01 A     0     60       0
2020-02-01 A     1     50       0
2020-03-01 A     0     50       50 ## have 50 for 3 months because A win in the previous month
2020-04-01 A     1     100      50 
2020-05-01 A     0     10      150 ## have 100 for 3 months because A win in the previous month
2020-06-01 A     0     10      100 ## disappear 50 after 3 months
2020-07-01 A     0     100     100 ## disappear 50 after 3 months
2020-08-01 A     0     100      0  ## disappear 100 after 3 months
2020-01-01 B     0     60       0
2020-02-01 B     0     50       0
2020-03-01 B     0     50       0  
2020-04-01 B     1     10       0 
2020-05-01 B     0     10       10 
2020-08-01 B     0     100      0

Case2: Real data

gameid = c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "B1", "B2", "B3", "B4", "B5", "B6")
date = c("2020-01-01", "2020-02-01", "2020-03-01","2020-04-01", "2020-05-01", "2020-06-01", "2020-06-01", "2020-07-01", "2020-08-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-08-01")
id = c("A", "A", "A", "A", "A", "A", "A", "A","A", "B", "B", "B", "B", "B", "B")
win = c(0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0)
quantity = c(60, 50, 50, 100, 10, NA, NA, 100, 100, 60, 50, 50, 10, 10, 100)
dat = tibble(gameid = gameid, date = as.Date(date), id = id, win = win, quantity = quantity)

> dat
# A tibble: 15 × 5
   gameid date       id      win quantity
   <chr>  <date>     <chr> <dbl>    <dbl>
 1 A1     2020-01-01 A         0       60
 2 A2     2020-02-01 A         1       50
 3 A3     2020-03-01 A         0       50
 4 A4     2020-04-01 A         1      100
 5 A5     2020-05-01 A         0       10
 6 A6     2020-06-01 A         0       NA
 7 A7     2020-06-01 A         0       NA # The Case 1 cannot distinguish this part.
 8 A8     2020-07-01 A         1      100
 9 A9     2020-08-01 A         0      100
10 B1     2020-01-01 B         0       60
11 B2     2020-02-01 B         0       50
12 B3     2020-03-01 B         0       50
13 B4     2020-04-01 B         1       10
14 B5     2020-05-01 B         0       10
15 B6     2020-08-01 B         0      100

Upvotes: 2

Views: 125

Answers (1)

r2evans
r2evans

Reputation: 160407

Complete rewrite, look in the history for the previous (incomplete/incorrect) answer.

While this looks like a rolling join, the fact that you want to consider the gaps means it's a range-based or non-equi join. For this, we'll use one of these packages: data.table, fuzzyjoin (dplyr-style), or sqldf. We'll also use lubridate to facilitate "3 months ago".

data.table

library(lubridate)
library(data.table)
datDT <- as.data.table(dat) # should use setDT(dat) if you're really going this route

datDT[, .(id, fromdate = date, todate = date %m+% months(3),
          w2 = win, q2 = quantity)
      ][datDT, on = .(id, fromdate < date, todate >= date)
        ][, .(stock = sum(c(0, q2[w2 > 0]))), by = .(gameid, date = fromdate, id, win, quantity)
          ][ is.na(stock), stock := 0 ][]
#     gameid       date id win quantity stock
#  1:     A1 2020-01-01  A   0       60     0
#  2:     A2 2020-02-01  A   1       50     0
#  3:     A3 2020-03-01  A   0       50    50
#  4:     A4 2020-04-01  A   1      100    50
#  5:     A5 2020-05-01  A   0       10   150
#  6:     A6 2020-06-01  A   0       NA   100
#  7:     A7 2020-06-01  A   0       NA   100
#  8:     A8 2020-07-01  A   1      100   100
#  9:     A9 2020-08-01  A   0      100   100
# 10:     B1 2020-01-01  B   0       60     0
# 11:     B2 2020-02-01  B   0       50     0
# 12:     B3 2020-03-01  B   0       50     0
# 13:     B4 2020-04-01  B   1       10     0
# 14:     B5 2020-05-01  B   0       10    10
# 15:     B6 2020-08-01  B   0      100     0

fuzzyjoin

library(dplyr)
# library(fuzzyjoin) # fuzzy_left_join
library(lubridate)

fuzzyjoin::fuzzy_left_join(
  dat, transform(dat, todate = date %m+% months(3)),
  by = c(id = "id", date = "date", date = "todate"),
  match_fun = list(`==`, `>`, `<=`)) %>%
  group_by(gameid = gameid.x, date = date.x, id = id.x, win = win.x, quantity = quantity.x) %>%
  summarize(stock = sum(quantity.y[win.y > 0]), .groups = "drop") %>%
  mutate(stock = coalesce(stock, 0)) %>%
  arrange(id, date)
#  # A tibble: 15 x 6
#    gameid date       id      win quantity stock
#    <chr>  <date>     <chr> <dbl>    <dbl> <dbl>
#  1 A1     2020-01-01 A         0       60     0
#  2 A2     2020-02-01 A         1       50     0
#  3 A3     2020-03-01 A         0       50    50
#  4 A4     2020-04-01 A         1      100    50
#  5 A5     2020-05-01 A         0       10   150
#  6 A6     2020-06-01 A         0       NA   100
#  7 A7     2020-06-01 A         0       NA   100
#  8 A8     2020-07-01 A         1      100   100
#  9 A9     2020-08-01 A         0      100   100
# 10 B1     2020-01-01 B         0       60     0
# 11 B2     2020-02-01 B         0       50     0
# 12 B3     2020-03-01 B         0       50     0
# 13 B4     2020-04-01 B         1       10     0
# 14 B5     2020-05-01 B         0       10    10
# 15 B6     2020-08-01 B         0      100     0\

sqldf

This is using a SQLite backend, but most (all?) of sqldf's backends should work with minimal change. We pre-calculate the todate into the original frame here.

library(lubridate)
sqldf::sqldf("
  select t1.gameid, t1.date, t1.id, t1.win, t1.quantity,
    sum(case when t2.win > 0 then t2.quantity else 0 end) as stock
  from dat t1
    left join dat t2 on t1.id = t2.id
      and t2.todate between t1.date and (t1.todate - 1)
  group by t1.gameid, t1.date, t1.id, t1.win, t1.quantity
  order by t1.id, t1.date")
#    gameid       date id win quantity stock
# 1      A1 2020-01-01  A   0       60     0
# 2      A2 2020-02-01  A   1       50     0
# 3      A3 2020-03-01  A   0       50    50
# 4      A4 2020-04-01  A   1      100    50
# 5      A5 2020-05-01  A   0       10   150
# 6      A6 2020-06-01  A   0       NA   100
# 7      A7 2020-06-01  A   0       NA   100
# 8      A8 2020-07-01  A   1      100   100
# 9      A9 2020-08-01  A   0      100   100
# 10     B1 2020-01-01  B   0       60     0
# 11     B2 2020-02-01  B   0       50     0
# 12     B3 2020-03-01  B   0       50     0
# 13     B4 2020-04-01  B   1       10     0
# 14     B5 2020-05-01  B   0       10    10
# 15     B6 2020-08-01  B   0      100     0

Upvotes: 2

Related Questions