Reputation: 434
Alternate title might be 'Use lag within mutate to refer to previous rows mutation'
I'd like to include values generated for previous rows as inputs to a mutate calculation. Some data:
mydiamonds <- diamonds %>%
mutate(Ideal = ifelse(cut == 'Ideal', 1, 0)) %>%
group_by(Ideal) %>%
mutate(rn = row_number()) %>%
arrange(Ideal, rn) %>%
mutate(CumPrice = cumsum(price)) %>%
mutate(InitialPrice = min(price)) %>%
select(Ideal, rn, CumPrice, InitialPrice)
Looks like this:
mydiamonds %>% head
# A tibble: 6 x 4
# Groups: Ideal [1]
Ideal rn CumPrice InitialPrice
<dbl> <int> <int> <int>
1 0 1 326 326
2 0 2 653 326
3 0 3 987 326
4 0 4 1322 326
5 0 5 1658 326
6 0 6 1994 326
A model:
mod.diamonds = glm(CumPrice ~ log(lag(CumPrice)) +log(rn) + Ideal , family = "poisson", data = mydiamonds)
Test the model:
# new data, pretend we don't know CumPrice but want to use predictions to predict subsequent predictions
mydiamonds.testdata <- mydiamonds %>% select(-CumPrice)
# manual prediction based on lag(prediction), for the first row in each group use InitialPrice
## add coefficients as fields
coeffs <- mod.diamonds$coefficients
mydiamonds.testdata <- mydiamonds.testdata %>%
mutate(CoefIntercept = coeffs['(Intercept)'],
CoefLogLagCumPrice = coeffs['log(lag(CumPrice))'],
CoefLogRn = coeffs['log(rn)'],
CoefIdeal = coeffs['Ideal']
)
Here's how my test data look:
mydiamonds.testdata %>% head
# A tibble: 6 x 7
# Groups: Ideal [1]
Ideal rn InitialPrice CoefIntercept CoefLogLagCumPrice CoefLogRn CoefIdeal
<dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 0 1 326 0.0931 0.987 0.0154 -0.000715
2 0 2 326 0.0931 0.987 0.0154 -0.000715
3 0 3 326 0.0931 0.987 0.0154 -0.000715
4 0 4 326 0.0931 0.987 0.0154 -0.000715
5 0 5 326 0.0931 0.987 0.0154 -0.000715
6 0 6 326 0.0931 0.987 0.0154 -0.000715
Cannot use predict(), since I need to recursively predict where predictions for the previous day/row are input to the current day. Instead try manual prediction using the coefficents:
# prediction
mydiamonds.testdata <- mydiamonds.testdata %>%
mutate(
Prediction = CoefIntercept +
# here's the hard bit. If it's the first row in the group, use InitialPrice, else use the value of the previous prediction
(CoefLogLagCumPrice * ifelse(rn == 1, InitialPrice, lag(Prediction))) +
(CoefLogRn * log(rn)) +
(CoefIdeal * Ideal)
)
Error: Problem with
mutate()
inputPrediction
. x object 'Prediction' not found ℹ InputPrediction
is+...
. ℹ The error occurred in group 1: Ideal = 0.
How can I mutate in this way, where I'd like to refer to the previous rows mutation? (Unless it's the very first row, in which case use InitialPrice)
[EDIT] following commenter, I gave it a shot with accumulate, a function I'm not so familiar with:
mydiamonds.testdata <- mydiamonds.testdata %>%
mutate(
Prediction = accumulate(.f = function(.) {
.$CoefIntercept +
# here's the hard bit. If it's the first row in the group, use InitialPrice, else use the value of the previous prediction
(.$CoefLogLagCumPrice * ifelse(.$rn == 1, .$InitialPrice, lag(.$Prediction))) +
(.$CoefLogRn * log(.$rn)) +
(.$CoefIdeal * .$Ideal)
}))
Error: Problem with `mutate()` input `Prediction`.
x argument ".x" is missing, with no default
ℹ Input `Prediction` is `accumulate(...)`.
ℹ The error occurred in group 1: Ideal = 0.
Upvotes: 5
Views: 885
Reputation: 8494
As you said you are not used to this rather complex function here is a bit of an explanation.
purrr::accumulate()
is used to calculate row-wise recursive operations. Its first argument .x
is the variable you want to accumulate on.
Its second argument .f
is a function that should have 2 arguments: the current result cur
, and the next evaluated value val
.
The first time the .f
is called, cur
is equal to .x[1]
(by default), then it is equal to the previous result returned by .f
.
purrr::accumulate2()
allows us to use a second variable .y
to iterate on. The first value of .y
is always ignored as .f
already know what to return at this time. Therefore, .y
should be one item shorter than .x
.
Unfortunately, there is only accumulate()
and accumulate2()
where you would need accumulate3()
or paccumulate()
to accumulate on rn, Ideal and Price.
However, by using row_number()
and cur_data()
, you can trick accumulate2()
to behave as you want:
CoefIntercept = coeffs['(Intercept)']
CoefLogLagCumPrice = coeffs['log(lag(CumPrice))']
CoefLogRn = coeffs['log(rn)']
CoefIdeal = coeffs['Ideal']
mydiamonds.testdata <- mydiamonds %>%
ungroup() %>%
select(-CumPrice) %>%
mutate(
Prediction = accumulate2(.x=InitialPrice, .y=row_number()[-1],
.f=function(acc, nxt, row) {
db=cur_data_all()
rn = db$rn[row]
Ideal = db$Ideal[row]
CoefIntercept +
(CoefLogLagCumPrice * acc) +
(CoefLogRn * log(rn)) +
(CoefIdeal * Ideal)
}) %>% unlist()
)
mydiamonds.testdata
# A tibble: 53,940 x 4
# Ideal rn InitialPrice Prediction
# <dbl> <int> <int> <dbl>
# 1 0 1 326 326
# 2 0 2 326 322.
# 3 0 3 326 318.
# 4 0 4 326 313.
# 5 0 5 326 309.
# 6 0 6 326 305.
# 7 0 7 326 301.
# 8 0 8 326 297.
# 9 0 9 326 294.
# 10 0 10 326 290.
EDIT: There is another, cleaner way using the .init
argument, as the InitialPrice
column is never really used except for its first value. This allows to directly use the arguments, but it won't work for more complex models with more covariates.
mydiamonds.testdata <- mydiamonds %>%
ungroup() %>%
select(-CumPrice) %>%
mutate(
Prediction = accumulate2(.x=Ideal[-1], .y=rn[-1],
.init=InitialPrice[1],
.f=function(rslt, Ideal, rn) {
CoefIntercept +
(CoefLogLagCumPrice * rslt) +
(CoefLogRn * log(rn)) +
(CoefIdeal * Ideal)
}) %>% unlist()
)
Upvotes: 6