Andrew Bannerman
Andrew Bannerman

Reputation: 1305

dplyr group by, carry forward value from previous group to next

Ok this is the over all view of what i'm trying to achieve with dplyr:

Example

Using dplyr I am making calculations to form new columns.

initial.capital - 
x.long.shares - 
x.end.value - 
x.net.profit - 
new.initial.capital

The code that does this:

# Calculate Share Prices For Each ETF 
# Initialize Start Capital Column 
library(dplyr)
library(data.table)
df$inital.capital <- 10000
output <- df %>%
  dplyr::mutate(RunID = data.table::rleid(x.long)) %>%
  group_by(RunID) %>%
  dplyr::mutate(x.long.shares = ifelse(x.long == 0,0, 
                                       ifelse(row_number() == n(),
                                      first(inital.capital) / first(close.x),0))) %>%
  dplyr::mutate(x.end.value = ifelse(x.long == 0,0, 
                                       ifelse(row_number() == n(),
                                              last(x.long.shares) * last(close.x),0))) %>%
  dplyr::mutate(x.net.profit = ifelse(x.long == 0,0, 
                                     ifelse(row_number() == n(),
                                            last(initial.capital) - last(x.end.value),0))) %>%
  dplyr::mutate(new.initial.capital = ifelse(x.long == 0,0, 
                                      ifelse(row_number() == n(),
                                             last(x.net.profit) + last(inital.capital),0))) %>%

  ungroup() %>%
  select(-RunID)

I am grouping per x.long column. And when grouped. Making calculations from different columns using the first/last positions within the group My basic question is:

In the photo, see red highlight under new.initial.capital column. How can I 'save' this value (10185.33)... and insert it on the NEXT group, saving it under initial.capital column, again highlighted in red (it would replace 10,000 Or storing it on the first line of the group)?

Edit

What I really need to do is save the final value in the new.initial.capital column into a variable. Then this variable can be used in the next group (see code below) The value here will be used as part of the next groups calculations... and then when the end new.initial.capital is updated, then this values goes into the variable, then it carrys to the start of the next group (see code below).. then all the values will update again.... The variable would be placed here:

output <- df %>%
  dplyr::mutate(RunID = data.table::rleid(x.long)) %>%
  group_by(RunID) %>%
  dplyr::mutate(x.long.shares = ifelse(x.long == 0,0, 
                                       ifelse(row_number() == n(),
                                      first(end_of_new.initial.capital_variable_from_previous_group) / first(close.x),0))) %>%

I essentially want to carry over values between dplyr groups. Is this possible? Or can I store it in a variable each time?

Heres some example data that is in the photo: Save to .txt

df <- read.table("your_dir\df.txt",header=TRUE, sep="", stringsAsFactors=FALSE)

    close.x x.long  y.short x.short y.long  inital.capital  x.long.shares   x.end.value x.net.profit    new.initial.capital
37.96   NA  NA  NA  NA  10000   NA  NA  NA  NA
36.52   0   0   0   0   10000   0   0   0   0
38.32   0   0   0   0   10000   0   0   0   0
38.5504 0   0   0   0   10000   0   0   0   0
38.17   0   0   0   0   10000   0   0   0   0
38.85   1   1   0   0   10000   0   0   0   0
38.53   1   1   0   0   10000   0   0   0   0
39.13   1   1   0   0   10000   0   0   0   0
38.13   1   1   0   0   10000   257.4002574 9814.671815 185.3281853 10185.32819
37.01   0   0   1   1   10000   0   0   0   0
36.14   0   0   1   1   10000   0   0   0   0
35.27   0   0   1   1   10000   0   0   0   0
35.13   0   0   1   1   10000   0   0   0   0
32.2    0   0   1   1   10000   0   0   0   0
33.03   1   1   0   0   10000   0   0   0   0
34.94   1   1   0   0   10000   0   0   0   0
34.57   1   1   0   0   10000   0   0   0   0
33.6    1   1   0   0   10000   0   0   0   0
34.34   1   1   0   0   10000   302.7550711 10396.60914 -396.6091432    9603.390857
35.86   0   0   1   1   10000   0   0   0   0

What I have Tried

I tried to make a variable:

inital.capital <- 10000

And insert this in the code...

output <- df %>%
  dplyr::mutate(RunID = data.table::rleid(x.long)) %>%
  group_by(RunID) %>%
  dplyr::mutate(x.long.shares = ifelse(x.long == 0,0, 
                                       ifelse(row_number() == n(),
                                              initial.capital / first(close.x),0))) %>%   # place initial.capital variable.. initialized with 10000
  dplyr::mutate(x.end.value = ifelse(x.long == 0,0, 
                                       ifelse(row_number() == n(),
                                              last(x.long.shares) * last(close.x),0))) %>%
  dplyr::mutate(x.net.profit = ifelse(x.long == 0,0, 
                                     ifelse(row_number() == n(),
                                            last(initial.capital) - last(x.end.value),0))) %>%
  dplyr::mutate(new.initial.capital = ifelse(x.long == 0,0, 
                                      ifelse(row_number() == n(),
                                             last(x.net.profit) + last(inital.capital),0))) %>%
  dplyr::mutate(new.initial.capitals = ifelse(x.long == 0,0, 
                                             ifelse(row_number() == n(),
                                                    inital.capital < - last(new.initial.capital),0))) %>%  # update variable with the final balance of new.inital.capital column

  ungroup() %>%
  select(-RunID)

If I can update the initial.capital variable each time. This then would serve as the 'link' between groups. However, this idea is not currently working in the dplyr setup.

Any assistance appreciated.

Upvotes: 10

Views: 4074

Answers (6)

John Haberstroh
John Haberstroh

Reputation: 485

Rolling forwards a value like that can be very difficult. I think it would be preferred to put in a line at the top that acts as a transaction whose net effect is to add 10k to your base capital. You can then use a cumulative sum on the offsets to achieve what you are looking for with relative ease:

pdf = df %>% group_by(group) %>% arrange(dates) %>% mutate(cs = cumsum(sales))

Code copied from r cumsum per group in dplyr

Upvotes: 0

Andrew Bannerman
Andrew Bannerman

Reputation: 1305

I decided to revisit this problem here is a solution by grouping per trade signal, making a start and end of trade group ID. After, use a normal for loop to do the calculations on ifelse statements and updating running variables between groups: shares, total_start_capital and total_end_capital. These allow carrying variables over from trade to the next trade and be used in each successive trade calculations. If only dplyr allowed updating of variables between groups. This has value if someone wants to create their own back test script with the use of PnL $ versus % rets.

# Dollar PnL Back Test Script Example 
# Andrew Bannerman 1.7.2017

df<- read.table(text="37.96   NA  NA  
36.52   0   0   
38.32   0   0   
38.55  0   0  
38.17   0   0   
38.85   1   1   
38.53   1   1  
39.13   1   1   
38.13   1   1    
37.01   0   0   
36.14   0   0   
35.27   0   0   
35.13   0   0   
32.2    0   0 
33.03   1   1   
34.94   1   1   
34.57   1   1  
33.6    1   1 
34.34   1   1  
35.86 0     0   ",stringsAsFactors=FALSE,header=TRUE)

colnames(df)[1] <- "close"
colnames(df)[2] <- "signal"
colnames(df)[3] <- "signal_short"

# Place group id at start/end of each group 
df <- df %>%
  dplyr::mutate(ID = data.table::rleid(signal)) %>%
  group_by(ID) %>%
  dplyr::mutate(TradeID = ifelse(signal ==1,as.numeric(row_number()),0))%>% # Run id per group month
  dplyr::mutate(group_id_last = ifelse(signal == 0,0, 
                                         ifelse(row_number() == n(), 3,0))) %>%
  dplyr::mutate(group_id_first = ifelse(TradeID == 1 & signal == 1,2,0))

############################################## 
# Custom loop 
################################################
run_start_equity <- 10000  # Enter starting equity
run_end_equity <- 0        # variable for updating end equity in loop
run.shares <- 0
df$start.balance <- 0
df$net.proceeds <- 0
df$end.balance <-0
df$shares <- 0
i=1
for (i in 1:nrow(df)) { 
  df$start.balance[i] <- ifelse(df$group_id_first[i] == 2, run_start_equity, 0)
  df$shares[i] <- ifelse(df$group_id_first[i] == 2, run_start_equity / df$close[i],0)
  run.shares <- ifelse(df$group_id_first[i] == 2, df$shares[i], run.shares)
  df$end.balance[i] <- ifelse(df$group_id_last[i] == 3, run.shares * df$close[i],0)
  run_end_equity <- ifelse(df$group_id_last[i] == 3, df$end.balance[i],run_end_equity)
  df$net.proceeds[i] <- ifelse(df$group_id_last[i] == 3, run_end_equity - run_start_equity,0)
  run_start_equity <- ifelse(df$group_id_last[i] == 3, df$end.balance[i] ,run_start_equity)
   }

With the desired output:

> df
# A tibble: 19 x 11
# Groups:   ID [5]
   close signal signal_short    ID TradeID group_id_last group_id_first start.balance net.proceeds end.balance   shares
   <dbl>  <int>        <int> <int>   <dbl>         <dbl>          <dbl>         <dbl>        <dbl>       <dbl>    <dbl>
 1 36.52      0            0     1       0             0              0         0.000       0.0000       0.000   0.0000
 2 38.32      0            0     1       0             0              0         0.000       0.0000       0.000   0.0000
 3 38.55      0            0     1       0             0              0         0.000       0.0000       0.000   0.0000
 4 38.17      0            0     1       0             0              0         0.000       0.0000       0.000   0.0000
 5 38.85      1            1     2       1             0              2     10000.000       0.0000       0.000 257.4003
 6 38.53      1            1     2       2             0              0         0.000       0.0000       0.000   0.0000
 7 39.13      1            1     2       3             0              0         0.000       0.0000       0.000   0.0000
 8 38.13      1            1     2       4             3              0         0.000    -185.3282    9814.672   0.0000
 9 37.01      0            0     3       0             0              0         0.000       0.0000       0.000   0.0000
10 36.14      0            0     3       0             0              0         0.000       0.0000       0.000   0.0000
11 35.27      0            0     3       0             0              0         0.000       0.0000       0.000   0.0000
12 35.13      0            0     3       0             0              0         0.000       0.0000       0.000   0.0000
13 32.20      0            0     3       0             0              0         0.000       0.0000       0.000   0.0000
14 33.03      1            1     4       1             0              2      9814.672       0.0000       0.000 297.1442
15 34.94      1            1     4       2             0              0         0.000       0.0000       0.000   0.0000
16 34.57      1            1     4       3             0              0         0.000       0.0000       0.000   0.0000
17 33.60      1            1     4       4             0              0         0.000       0.0000       0.000   0.0000
18 34.34      1            1     4       5             3              0         0.000     389.2589   10203.931   0.0000
19 35.86      0            0     5       0             0              0         0.000       0.0000       0.000   0.0000

Upvotes: 0

moodymudskipper
moodymudskipper

Reputation: 47300

This kind of use of first and last is very untidy, so we'll keep it for the latest step.

First we build intermediate data, following your code, but adding some columns to join later at the right places. I'm not sure if you need to keep all columns, you won't need the second join if not.

library(dplyr)
library(tidyr)

df1 <- df0 %>%
  dplyr::mutate(RunID = data.table::rleid(x.long)) %>%
  group_by(RunID) %>%
  mutate(RunID_f = ifelse(row_number()==1,RunID,NA)) %>%  #  for later merge
  mutate(RunID_l = ifelse(row_number()==n(),RunID,NA))    #  possibly unneeded

Then we build summarized data, I refactored your code a bit as you see, because these operations "should" be rowwise.

summarized_data <- df1 %>%
  filter(x.long !=0) %>%
  summarize_at(vars(close.x,inital.capital),c("first","last")) %>%
  mutate(x.long.share        = inital.capital_first / close.x_first,
         x.end.value         = x.long.share         * close.x_last,
         x.net.profit        = inital.capital_last - x.end.value,
         new.initial.capital = x.net.profit         + inital.capital_last,
         lagged.new.initial.capital = lag(new.initial.capital,1))

# A tibble: 2 x 10
#   RunID close.x_first inital.capital_first close.x_last inital.capital_last x.long.share x.end.value x.net.profit new.initial.capital lagged.new.initial.capital
#   <int>         <dbl>                <int>        <dbl>               <int>        <dbl>       <dbl>        <dbl>               <dbl>                      <dbl>
# 1     3         38.85                10000        38.13               10000     257.4003    9814.672     185.3282           10185.328                         NA
# 2     5         33.03                10000        34.34               10000     302.7551   10396.609    -396.6091            9603.391                   10185.33

Then we join our summarized table to the original, getting advantage of the trick of the firt step. The first join may be skipped if you don't need all columns.

df2 <- df1 %>% ungroup %>%
  left_join(summarized_data %>% select(-lagged.new.initial.capital) ,by=c("RunID_l"="RunID")) %>%      # if you want the other variables, if not, skip the line
  left_join(summarized_data %>% select(RunID,lagged.new.initial.capital) ,by=c("RunID_f"="RunID")) %>%
  mutate(inital.capital = ifelse(is.na(lagged.new.initial.capital),inital.capital,lagged.new.initial.capital)) %>%
  select(close.x:inital.capital) # for readability here

# # A tibble: 20 x 6
# close.x x.long y.short x.short y.long inital.capital
# <dbl>  <int>   <int>   <int>  <int>          <dbl>
#  1 37.9600     NA      NA      NA     NA       10000.00
#  2 36.5200      0       0       0      0       10000.00
#  3 38.3200      0       0       0      0       10000.00
#  4 38.5504      0       0       0      0       10000.00
#  5 38.1700      0       0       0      0       10000.00
#  6 38.8500      1       1       0      0       10000.00
#  7 38.5300      1       1       0      0       10000.00
#  8 39.1300      1       1       0      0       10000.00
#  9 38.1300      1       1       0      0       10000.00
# 10 37.0100      0       0       1      1       10000.00
# 11 36.1400      0       0       1      1       10000.00
# 12 35.2700      0       0       1      1       10000.00
# 13 35.1300      0       0       1      1       10000.00
# 14 32.2000      0       0       1      1       10000.00
# 15 33.0300      1       1       0      0       10185.33
# 16 34.9400      1       1       0      0       10000.00
# 17 34.5700      1       1       0      0       10000.00
# 18 33.6000      1       1       0      0       10000.00
# 19 34.3400      1       1       0      0       10000.00
# 20 35.8600      0       0       1      1       10000.00

data

df<- read.table(text="close.x x.long  y.short x.short y.long  inital.capital  x.long.shares   x.end.value x.net.profit    new.initial.capital
37.96   NA  NA  NA  NA  10000   NA  NA  NA  NA
36.52   0   0   0   0   10000   0   0   0   0
38.32   0   0   0   0   10000   0   0   0   0
38.5504 0   0   0   0   10000   0   0   0   0
38.17   0   0   0   0   10000   0   0   0   0
38.85   1   1   0   0   10000   0   0   0   0
38.53   1   1   0   0   10000   0   0   0   0
39.13   1   1   0   0   10000   0   0   0   0
38.13   1   1   0   0   10000   257.4002574 9814.671815 185.3281853 10185.32819
37.01   0   0   1   1   10000   0   0   0   0
36.14   0   0   1   1   10000   0   0   0   0
35.27   0   0   1   1   10000   0   0   0   0
35.13   0   0   1   1   10000   0   0   0   0
32.2    0   0   1   1   10000   0   0   0   0
33.03   1   1   0   0   10000   0   0   0   0
34.94   1   1   0   0   10000   0   0   0   0
34.57   1   1   0   0   10000   0   0   0   0
33.6    1   1   0   0   10000   0   0   0   0
34.34   1   1   0   0   10000   302.7550711 10396.60914 -396.6091432    9603.390857
35.86   0   0   1   1   10000   0   0   0   0",stringsAsFactors=FALSE,header=TRUE)

df0 <- df %>% select(close.x:inital.capital)

Upvotes: 1

Matt Dowle
Matt Dowle

Reputation: 59602

You're using data.table in the question and have tagged the question data.table, so here is a data.table answer. When j evaluates, it's in a static scope where local variables retain their values from the previous group.

Using dummy data to demonstrate :

require(data.table)
set.seed(1)
DT = data.table( long = rep(c(0,1,0,1),each=3),
                 val = sample(5,12,replace=TRUE))
DT
    long val
 1:    0   2
 2:    0   2
 3:    0   3
 4:    1   5
 5:    1   2
 6:    1   5
 7:    0   5
 8:    0   4
 9:    0   4
10:    1   1
11:    1   2
12:    1   1

DT[, v1:=sum(val), by=rleid(long)][]
    long val v1
 1:    0   2  7
 2:    0   2  7
 3:    0   3  7
 4:    1   5 12
 5:    1   2 12
 6:    1   5 12
 7:    0   5 13
 8:    0   4 13
 9:    0   4 13
10:    1   1  4
11:    1   2  4
12:    1   1  4

So far, simple enough.

prev = NA  # initialize previous group value
DT[, v2:={ans<-last(val)/prev; prev<-sum(val); ans}, by=rleid(long)][]
    long val v1         v2
 1:    0   2  7         NA
 2:    0   2  7         NA
 3:    0   3  7         NA
 4:    1   5 12 0.71428571
 5:    1   2 12 0.71428571
 6:    1   5 12 0.71428571
 7:    0   5 13 0.33333333
 8:    0   4 13 0.33333333
 9:    0   4 13 0.33333333
10:    1   1  4 0.07692308
11:    1   2  4 0.07692308
12:    1   1  4 0.07692308

> 3/NA
[1] NA
> 5/7
[1] 0.7142857
> 4/12
[1] 0.3333333
> 1/13
[1] 0.07692308
> prev
[1] NA

Notice that the prev value did not update because prev and ans are local variables inside j's scope that were being updated as each group ran. Just to illustrate, the global prev can be updated from within each group using R's <<- operator :

DT[, v2:={ans<-last(val)/prev; prev<<-sum(val); ans}, by=rleid(long)]
prev
[1] 4

But there's no need to use <<- in data.table as local variables are static (retain their values from previous group). Unless you need to use the final group's value after the query has finished.

Upvotes: 12

Empiromancer
Empiromancer

Reputation: 3854

You're going to have a hard time finding an 'elegant' pure-dplyr solution, because dplyr isn't really designed to do this. What dplyr likes to do is map/reduce type operations (mutate and summarize) that use window and summary functions respectively. What you're asking for isn't really either of those, because you want each group to depend on the last, so you're really describing a looping operation with side effects - two very non-R-philosophy operations.

If you want to hack your way into doing what you describe, you can try an approach like this:

new.initial.capital <- 0
for (z in split(df, df$x.long)) {
    z$initial.capital[[1]] <- new.initial.capital
    # some other calculations here
    # maybe you want to modify df as well
    new.initial.capital <- foo
}

However, this is really not a very R-friendly piece of code, as it depends on side effects and loops. I would advise seeing if you can reframe your calculations in terms of a summary and/or window function if you want to integrate with dplyr.

For more:
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
https://danieljhocking.wordpress.com/2014/12/03/lags-and-moving-means-in-dplyr/

Upvotes: 2

Kim
Kim

Reputation: 4298

It took me a very long time to understand what you are going for: for a single "update", does this work?

library(tidyverse)
library(magrittr)
temp <- df %>% 
  dplyr::mutate(RunID = data.table::rleid(x.long)) %>%
  group_by(RunID) %>% # Don't delete the RunID 
  dplyr::mutate(max.new = max(new.initial.capital)) %>% 
  slice(1) %>%
  arrange(x.long) %>% 
  dplyr::mutate(pass.value = lag(max.new))

df <- left_join(df, temp %>% dplyr::select(x.long, RunID, pass.value)

After this, replace values of initial.capital using pass.value column, according to grouped row_number as you have done above.

I'm not quite sure how to go about it without looping this updating procedure, and I guess if you want to do 10,000 updates like this it will certainly be a bummer. But it will be enable you to "pass" the value to the second red cell as in your picture.

Upvotes: 0

Related Questions