user14328853
user14328853

Reputation: 434

Recursive lag() during mutate?

Alternate title might be 'Use lag within mutate to refer to previous rows mutation'

I'd like to include values generated for previous rows as inputs to a mutate calculation. Some data:

mydiamonds <- diamonds %>%
  mutate(Ideal = ifelse(cut == 'Ideal', 1, 0)) %>% 
  group_by(Ideal) %>% 
  mutate(rn = row_number()) %>% 
  arrange(Ideal, rn) %>% 
  mutate(CumPrice = cumsum(price)) %>% 
  mutate(InitialPrice = min(price)) %>% 
  select(Ideal, rn, CumPrice, InitialPrice)

Looks like this:

mydiamonds %>% head
# A tibble: 6 x 4
# Groups:   Ideal [1]
  Ideal    rn CumPrice InitialPrice
  <dbl> <int>    <int>        <int>
1     0     1      326          326
2     0     2      653          326
3     0     3      987          326
4     0     4     1322          326
5     0     5     1658          326
6     0     6     1994          326

A model:

mod.diamonds = glm(CumPrice ~ log(lag(CumPrice)) +log(rn) + Ideal , family = "poisson", data = mydiamonds)

Test the model:

# new data, pretend we don't know CumPrice but want to use predictions to predict subsequent predictions
mydiamonds.testdata <- mydiamonds %>% select(-CumPrice)
# manual prediction based on lag(prediction), for the first row in each group use InitialPrice
## add coefficients as fields
coeffs <- mod.diamonds$coefficients
mydiamonds.testdata <- mydiamonds.testdata %>% 
  mutate(CoefIntercept = coeffs['(Intercept)'],
         CoefLogLagCumPrice = coeffs['log(lag(CumPrice))'],
         CoefLogRn = coeffs['log(rn)'],
         CoefIdeal = coeffs['Ideal']
         )

Here's how my test data look:

 mydiamonds.testdata %>% head
# A tibble: 6 x 7
# Groups:   Ideal [1]
  Ideal    rn InitialPrice CoefIntercept CoefLogLagCumPrice CoefLogRn CoefIdeal
  <dbl> <int>        <int>         <dbl>              <dbl>     <dbl>     <dbl>
1     0     1          326        0.0931              0.987    0.0154 -0.000715
2     0     2          326        0.0931              0.987    0.0154 -0.000715
3     0     3          326        0.0931              0.987    0.0154 -0.000715
4     0     4          326        0.0931              0.987    0.0154 -0.000715
5     0     5          326        0.0931              0.987    0.0154 -0.000715
6     0     6          326        0.0931              0.987    0.0154 -0.000715

Cannot use predict(), since I need to recursively predict where predictions for the previous day/row are input to the current day. Instead try manual prediction using the coefficents:

# prediction
mydiamonds.testdata <- mydiamonds.testdata %>% 
  mutate(
    Prediction = CoefIntercept + 
      
      # here's the hard bit. If it's the first row in the group, use InitialPrice, else use the value of the previous prediction
      (CoefLogLagCumPrice * ifelse(rn == 1, InitialPrice, lag(Prediction))) + 
      
      (CoefLogRn * log(rn)) + 
      (CoefIdeal * Ideal)
    )

Error: Problem with mutate() input Prediction. x object 'Prediction' not found ℹ Input Prediction is +.... ℹ The error occurred in group 1: Ideal = 0.

How can I mutate in this way, where I'd like to refer to the previous rows mutation? (Unless it's the very first row, in which case use InitialPrice)

[EDIT] following commenter, I gave it a shot with accumulate, a function I'm not so familiar with:

mydiamonds.testdata <- mydiamonds.testdata %>% 
  mutate(
    Prediction = accumulate(.f = function(.) {
      
    .$CoefIntercept + 
      
      # here's the hard bit. If it's the first row in the group, use InitialPrice, else use the value of the previous prediction
      (.$CoefLogLagCumPrice * ifelse(.$rn == 1, .$InitialPrice, lag(.$Prediction))) + 
      
      (.$CoefLogRn * log(.$rn)) + 
      (.$CoefIdeal * .$Ideal)
      
      }))
Error: Problem with `mutate()` input `Prediction`.
x argument ".x" is missing, with no default
ℹ Input `Prediction` is `accumulate(...)`.
ℹ The error occurred in group 1: Ideal = 0.

Upvotes: 5

Views: 885

Answers (1)

Dan Chaltiel
Dan Chaltiel

Reputation: 8494

As you said you are not used to this rather complex function here is a bit of an explanation.

purrr::accumulate() is used to calculate row-wise recursive operations. Its first argument .x is the variable you want to accumulate on. Its second argument .f is a function that should have 2 arguments: the current result cur, and the next evaluated value val. The first time the .f is called, cur is equal to .x[1] (by default), then it is equal to the previous result returned by .f.

purrr::accumulate2() allows us to use a second variable .y to iterate on. The first value of .y is always ignored as .f already know what to return at this time. Therefore, .y should be one item shorter than .x.

Unfortunately, there is only accumulate() and accumulate2() where you would need accumulate3() or paccumulate() to accumulate on rn, Ideal and Price.

However, by using row_number() and cur_data(), you can trick accumulate2() to behave as you want:

CoefIntercept = coeffs['(Intercept)']
CoefLogLagCumPrice = coeffs['log(lag(CumPrice))']
CoefLogRn = coeffs['log(rn)']
CoefIdeal = coeffs['Ideal']

mydiamonds.testdata <- mydiamonds %>% 
  ungroup() %>% 
  select(-CumPrice) %>% 
  mutate(
    Prediction = accumulate2(.x=InitialPrice, .y=row_number()[-1], 
                             .f=function(acc, nxt, row) {
      db=cur_data_all()
      rn = db$rn[row]
      Ideal = db$Ideal[row]
      CoefIntercept +
        (CoefLogLagCumPrice * acc) +
        (CoefLogRn * log(rn)) +
        (CoefIdeal * Ideal)
      
    }) %>% unlist()
  )
mydiamonds.testdata

# A tibble: 53,940 x 4
#     Ideal    rn InitialPrice Prediction
#     <dbl> <int>        <int>      <dbl>
# 1       0     1          326       326 
# 2       0     2          326       322.
# 3       0     3          326       318.
# 4       0     4          326       313.
# 5       0     5          326       309.
# 6       0     6          326       305.
# 7       0     7          326       301.
# 8       0     8          326       297.
# 9       0     9          326       294.
# 10      0    10          326       290.

EDIT: There is another, cleaner way using the .init argument, as the InitialPrice column is never really used except for its first value. This allows to directly use the arguments, but it won't work for more complex models with more covariates.

mydiamonds.testdata <- mydiamonds %>% 
  ungroup() %>% 
  select(-CumPrice) %>% 
  mutate(
    Prediction = accumulate2(.x=Ideal[-1], .y=rn[-1], 
                             .init=InitialPrice[1],
                             .f=function(rslt, Ideal, rn) {
      CoefIntercept +
        (CoefLogLagCumPrice * rslt) +
        (CoefLogRn * log(rn)) +
        (CoefIdeal * Ideal)
      
    }) %>% unlist()
  )

Upvotes: 6

Related Questions