guyabel
guyabel

Reputation: 8366

Projections within a data frame

I am trying to come up with a way to do a projection within a data frame, preferably dplyr.

library("dplyr")

set.seed(1)
df0 <- data_frame(t = 0:5, 
                  r = c(NA, rnorm(n = 5, mean = 1, sd = 0.1)), 
                  P = c(100, rep(x = NA, times = 5)))
df0
# Source: local data frame [6 x 3]
# 
#       t         r     P
#   (int)     (dbl) (dbl)
# 1     0        NA   100
# 2     1 0.9373546    NA
# 3     2 1.0183643    NA
# 4     3 0.9164371    NA
# 5     4 1.1595281    NA
# 6     5 1.0329508    NA

I am little stuck as to how to run the projection model recursively...

df0 %>%
  mutate(P = ifelse(test = is.na(P), yes = lag(P)*r, no = P))
# Source: local data frame [6 x 3]
# 
#       t         r         P
#   (int)     (dbl)     (dbl)
# 1     0        NA 100.00000
# 2     1 0.9373546  93.73546
# 3     2 1.0183643        NA
# 4     3 0.9164371        NA
# 5     4 1.1595281        NA
# 6     5 1.0329508        NA

Does anyone know if this is possible?

I have in mind to do this across multiple regions using group_by. The data frame will be quite large, hence the preference for a speedy solution on something other than a data.frame type object.

The only solution I can think of thus far uses a for loop...

for(i in 1:5)
  df0 <- df0 %>% mutate(P = ifelse(is.na(P), yes = lag(P)*r, no = P))
df0
# Source: local data frame [6 x 3]
# 
#       t         r         P
#   (int)     (dbl)     (dbl)
# 1     0        NA 100.00000
# 2     1 0.9373546  93.73546
# 3     2 1.0183643  95.45685
# 4     3 0.9164371  87.48020
# 5     4 1.1595281 101.43575
# 6     5 1.0329508 104.77814

... which can lead to memory problems with my big data set and given all I have read about for loops in R, is probably not the best solution available.

EDIT

Some nice answers using purrr to a very similar questions but for simulations. Written up in a blog post.

Upvotes: 2

Views: 592

Answers (1)

user5249203
user5249203

Reputation: 4648

May be a data.table solution since you mentioned speedy solution ?

DT <- data.table(df0)
for(i in 1:nrow(DT)) 
    set(DT, j = 3L, value = ifelse(is.na(DT$P), yes = lag(DT$P)*DT$r, no = DT$P))   
DT
 t         r         P
1: 0        NA 100.00000
2: 1 0.9373546  93.73546
3: 2 1.0183643  95.45685
4: 3 0.9164371  87.48020
5: 4 1.1595281 101.43575
6: 5 1.0329508 104.77814

speed comparison...

f_dt <- 
  function(){
    for (i in 1:nrow(DT))
     set(DT, j = 3L, value = ifelse(is.na(DT[,P]), yes = lag(DT$P)*DT$r, no = DT$P)) 
    DT
  }

f_dplyr <- 
  function(){
    for (i in 1:nrow(df0))
      df0 <- mutate(df0, P = ifelse(is.na(P), yes = lag(P)*r, no = P))
    df0
  }
f_cumprod <- 
  function(){
    res <- c(df0$P[1],df0$P[1]*cumprod(df0$r[-1]))
    res
  }

library(microbenchmark)

microbenchmark(f_dt(),f_dplyr(),f_cumprod(),times = 100)

Unit: microseconds # only 500 rows
        expr        min         lq         mean      median         uq        max neval
      f_dt() 178350.056 186226.605 192842.91784 190115.9120 195791.748 272405.911   100
   f_dplyr() 307450.092 323326.566 331586.39073 328444.5255 335888.287 387716.640   100
 f_cumprod()     27.798     34.213     45.11819     43.4075     52.175     75.268   100

Upvotes: 1

Related Questions