Reputation: 1852
I’m trying to calculate the size of an winning and losing streak, and this question is an follow-up of an earlier question I had when I tried to calculate the length of an streak.
This is what my data looks like:
> subRes
Instrument TradeResult.Currency.
1 JPM -3
2 JPM 264
3 JPM 284
4 JPM 69
5 JPM 283
6 JPM -219
7 JPM -91
8 JPM 165
9 JPM -35
10 JPM -294
11 KFT -8
12 KFT -48
13 KFT 125
14 KFT -150
15 KFT -206
16 KFT 107
17 KFT 107
18 KFT 56
19 KFT -26
20 KFT 189
> dput(subRes)
structure(list(Instrument = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("JPM",
"KFT"), class = "factor"), TradeResult.Currency. = c(-3, 264,
284, 69, 283, -219, -91, 165, -35, -294, -8, -48, 125, -150,
-206, 107, 107, 56, -26, 189)), .Names = c("Instrument", "TradeResult.Currency."
), class = "data.frame", row.names = c(NA, 20L))
My goal:
I want to calculate the size of the longest winning and losing streak, for each instrument. So, for JPM
this would be the rows 2, 3, 4, and 5 from the data above, which gives the following TradeResult.Currency.
values: 264 + 284 + 69 +283, for a total of 900. The size of the longest losing streak for JPM would be row 9 and 10, which give an total result of -329 (-35 +-294). For KFT
the size of the longest winning streak is 270 (107 + 107 + 56, rows 16 till 18), and the size of the longest losing streak would be -356 (-150 + -206, rows 14 & 15).
The following function gives the correct size of the winning streak…
WinStreakSize <- function(x){
df.rle <- ifelse(x > 0, 1, 0)
df.rle <- rle(df.rle)
wh <- which(df.rle$lengths == max(df.rle$lengths))
mx <- df.rle$lengths[wh]
suma <- df.rle$lengths[1:wh]
out <- x[(sum(suma) - (suma[length(suma)] - 1)):sum(suma)]
return(sum(out))
}
.. resulting in:
> with(subRes, tapply(TradeResult.Currency., Instrument, WinStreakSize)
+ )
JPM KFT
900 270
However, I can’t seem to adept this function to display the size of the longest losing streak (so that it would output -329 for JPM and -356 for KFT), how stupid that may sound. I tried to change the function in numerous ways, stripped it and rebuild it, and I can't find the cause of it.
Here’s what I mean (output from debugging the function, where the x
values are the values for JPM after splitting subRes
):
Browse[2]> ifelse(x > 0, 1, 0)
[1] 0 1 1 1 1 0 0 1 0 0
Browse[2]> ifelse(x < 0, 1, 0)
[1] 1 0 0 0 0 1 1 0 1 1
Browse[2]> rle( ifelse(x > 0, 1, 0))
Run Length Encoding
lengths: int [1:5] 1 4 2 1 2
values : num [1:5] 0 1 0 1 0
Browse[2]> rle( ifelse(x < 0, 1, 0))
Run Length Encoding
lengths: int [1:5] 1 4 2 1 2
values : num [1:5] 1 0 1 0 1
Browse[2]> inverse.rle( ifelse(x > 0, 1, 0))
Error in x$lengths : $ operator is invalid for atomic vectors
Browse[2]> rle( !ifelse(x < 0, 1, 0))
Run Length Encoding
lengths: int [1:5] 1 4 2 1 2
values : logi [1:5] FALSE TRUE FALSE TRUE FALSE
So, changing the conditions in this function makes no difference in the output of the function. That would suggest I'm looking at the wrong part of the function for an solution, yet the ifelse
statement is the first of the function. In other words, from line 1 and on, the function uses incorrect input despite changing the conditions.
What obvious point am I missing?
Upvotes: 4
Views: 1620
Reputation: 108583
rle(ifelse(x>0,1,0))
is principally the same as rle(ifelse(x<0,1,0))
or rle(x>0)
or rle(x<0)
, with that difference that the values for the runs are different. But you never work with the values of the runs in your function, so that doesn't matter. As you select on the lengths and not on the values, it's obvious you'll get the same result every time again.
Let me simplify the things a bit. With the underlying function, I demonstrate the calculation of both the run lengths and the totals. Take into account that your solution in the question is not exact : there are 2 longest negative runs for JPM. I chose to only return the one with the largest absolute value.
MaxStreakSize <- function(x){
# Get the run lengths and values
df.rle <- rle(x>0)
ngroups <- length(df.rle$lengths)
ll <- df.rle$lengths
val <- df.rle$values
# calculate the sums
id <- rep(1:ngroups,ll)
sums <- tapply(x,id,sum)
# find the largest runs for positive (val) and negative (!val)
rmax <- which(ll==max(ll[val]) & val )
rmin <- which(ll==max(ll[!val]) & !val )
out <- list(
"Lose"=c("length"=max(ll[rmin]),
"sum"=min(sums[rmin])),
"Win"=c("length"=max(ll[rmax]),
"sum"=max(sums[rmax]))
)
return(out)
}
In these kind of problems it's very good to get some kind of index, based on the number of groups and the length of the runs. That makes life already a whole lot easier. This allows me to calculate sums, means etc. with a simple tapply
. After I constructed three vectors of the same length (ll
, sums
and val
), I can link the length, the value and the sum of the runs easily together and select whatever I want to get out.
An advantage of using rle(x>0) is that you can use the values as index, which greatly simplifies things.
Upvotes: 5