Jos
Jos

Reputation: 1852

Calculating the size of an winning and losing streak

I’m trying to calculate the size of an winning and losing streak, and this question is an follow-up of an earlier question I had when I tried to calculate the length of an streak.

This is what my data looks like:

> subRes
   Instrument TradeResult.Currency.
1         JPM                    -3
2         JPM                   264
3         JPM                   284
4         JPM                    69
5         JPM                   283
6         JPM                  -219
7         JPM                   -91
8         JPM                   165
9         JPM                   -35
10        JPM                  -294
11        KFT                    -8
12        KFT                   -48
13        KFT                   125
14        KFT                  -150
15        KFT                  -206
16        KFT                   107
17        KFT                   107
18        KFT                    56
19        KFT                   -26
20        KFT                   189
> dput(subRes)
structure(list(Instrument = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("JPM", 
"KFT"), class = "factor"), TradeResult.Currency. = c(-3, 264, 
284, 69, 283, -219, -91, 165, -35, -294, -8, -48, 125, -150, 
-206, 107, 107, 56, -26, 189)), .Names = c("Instrument", "TradeResult.Currency."
), class = "data.frame", row.names = c(NA, 20L))

My goal: I want to calculate the size of the longest winning and losing streak, for each instrument. So, for JPM this would be the rows 2, 3, 4, and 5 from the data above, which gives the following TradeResult.Currency. values: 264 + 284 + 69 +283, for a total of 900. The size of the longest losing streak for JPM would be row 9 and 10, which give an total result of -329 (-35 +-294). For KFT the size of the longest winning streak is 270 (107 + 107 + 56, rows 16 till 18), and the size of the longest losing streak would be -356 (-150 + -206, rows 14 & 15).

The following function gives the correct size of the winning streak…

WinStreakSize <- function(x){
    df.rle <- ifelse(x > 0, 1, 0)
    df.rle <- rle(df.rle)
    wh <- which(df.rle$lengths == max(df.rle$lengths))
    mx <- df.rle$lengths[wh]
    suma <- df.rle$lengths[1:wh]
    out <- x[(sum(suma) - (suma[length(suma)] - 1)):sum(suma)]
    return(sum(out))
}

.. resulting in:

> with(subRes, tapply(TradeResult.Currency., Instrument, WinStreakSize)
+ )
JPM KFT 
900 270

However, I can’t seem to adept this function to display the size of the longest losing streak (so that it would output -329 for JPM and -356 for KFT), how stupid that may sound. I tried to change the function in numerous ways, stripped it and rebuild it, and I can't find the cause of it.

Here’s what I mean (output from debugging the function, where the x values are the values for JPM after splitting subRes):

Browse[2]>  ifelse(x > 0, 1, 0)
 [1] 0 1 1 1 1 0 0 1 0 0
Browse[2]>  ifelse(x < 0, 1, 0)
 [1] 1 0 0 0 0 1 1 0 1 1
Browse[2]> rle( ifelse(x > 0, 1, 0))
Run Length Encoding
  lengths: int [1:5] 1 4 2 1 2
  values : num [1:5] 0 1 0 1 0
Browse[2]> rle( ifelse(x < 0, 1, 0))
Run Length Encoding
  lengths: int [1:5] 1 4 2 1 2
  values : num [1:5] 1 0 1 0 1
Browse[2]> inverse.rle( ifelse(x > 0, 1, 0))
Error in x$lengths : $ operator is invalid for atomic vectors
Browse[2]> rle( !ifelse(x < 0, 1, 0))
Run Length Encoding
  lengths: int [1:5] 1 4 2 1 2
  values : logi [1:5] FALSE TRUE FALSE TRUE FALSE

So, changing the conditions in this function makes no difference in the output of the function. That would suggest I'm looking at the wrong part of the function for an solution, yet the ifelse statement is the first of the function. In other words, from line 1 and on, the function uses incorrect input despite changing the conditions.

What obvious point am I missing?

Upvotes: 4

Views: 1620

Answers (1)

Joris Meys
Joris Meys

Reputation: 108583

rle(ifelse(x>0,1,0)) is principally the same as rle(ifelse(x<0,1,0)) or rle(x>0) or rle(x<0), with that difference that the values for the runs are different. But you never work with the values of the runs in your function, so that doesn't matter. As you select on the lengths and not on the values, it's obvious you'll get the same result every time again.

Let me simplify the things a bit. With the underlying function, I demonstrate the calculation of both the run lengths and the totals. Take into account that your solution in the question is not exact : there are 2 longest negative runs for JPM. I chose to only return the one with the largest absolute value.

MaxStreakSize <- function(x){
    # Get the run lengths and values
    df.rle <- rle(x>0)
    ngroups <- length(df.rle$lengths)
    ll <- df.rle$lengths
    val <- df.rle$values

    # calculate the sums
    id <- rep(1:ngroups,ll)
    sums <- tapply(x,id,sum)

    # find the largest runs for positive (val) and negative (!val)
    rmax <- which(ll==max(ll[val]) & val )
    rmin <- which(ll==max(ll[!val]) & !val )

    out <- list(
            "Lose"=c("length"=max(ll[rmin]),
                      "sum"=min(sums[rmin])),
            "Win"=c("length"=max(ll[rmax]),
                    "sum"=max(sums[rmax]))
            )
    return(out)
}

In these kind of problems it's very good to get some kind of index, based on the number of groups and the length of the runs. That makes life already a whole lot easier. This allows me to calculate sums, means etc. with a simple tapply. After I constructed three vectors of the same length (ll, sums and val), I can link the length, the value and the sum of the runs easily together and select whatever I want to get out.

An advantage of using rle(x>0) is that you can use the values as index, which greatly simplifies things.

Upvotes: 5

Related Questions