Andrew Bannerman
Andrew Bannerman

Reputation: 1305

R - Find maximum run of positive / negative values

I have a data frame of about 3000 rows. I wish to find the longest run of positive and negative numbers.

My example dataframe: df

1   0.502310591 
2   -0.247577976    
3   -0.307256769    2
4   0.442253678 
5   -0.795770351    
6   2.08244648  
7   -0.01672777 
8   -0.164145656    2
9   0.610117365 
10  0.014758371 
11  0.381105476 
12  0.721386493 4
13  -0.363222383    
14  0.201409322 
15  0.724867214 
16  -1.586829584    
17  1.066288451 
18  0.182824494 
19  0.237447191 
20  -0.215475797

Longest positive run: 4 Longest negative run: 2

I am following this tutorial: https://ocw.mit.edu/ans7870/18/18.05/s14/html/r-tut-rle.html

I need to fund the longest run of values >0 and also <0. So is there any way to edit the above?

I guess this only finds longest lengths of 1,0? If that case then i would need a helper column ifelse 1,0 to split the negative , positive... then perhaps find longest length with this code:

> df$wins <- ifelse(df$V2 > 0, 1, 0) 
> df$loss <- ifelse(df$V2 < 0, 1, 0)
> win <- (c(df$wins))
> max(rle(win)$lengths)
[1] 4

This works for finding maximum wins...

This is for losses:

> print(df$loss)
 [1] 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1
> df$loss <- ifelse(df$V2 < 0, 1, 0)
> print(df$loss)
 [1] 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1
> loss <- (c(df$loss))
> max(rle(loss)$lengths)
[1] 4

Not sure why it says 4... there is clearly 2 maximum losses, anyone see why? What am i missing, shouldn't the logic of the wins work exactly the same for the loss? I cant see any error in the code...

if value in df$loss is less than 0 print 1 else 0. make a vector containing the contents of df$loss column find max length of 1's using max(rle(loss)$lengths) again result is 4.. however, its clearly 2?

Upvotes: 0

Views: 3223

Answers (3)

hvollmeier
hvollmeier

Reputation: 2986

To get the maximum positive/negative runs in a series ( in your case in a column of a data.frame the rle function is all you need:

set.seed(123)
df <- data.frame(col1=rnorm(20, mean = 0, sd = 1))

max. pos run:

max(rle(sign(df$col1))[[1]][rle(sign(df$col1))[[2]] == 1])
[1] 5

max. neg.run:

max(rle(sign(df$col1))[[1]][rle(sign(df$col1))[[2]] == -1])
[1] 3

let’s check the results:

> df
          col1
1  -0.56047565
2  -0.23017749
3   1.55870831
4   0.07050839
5   0.12928774
6   1.71506499
7   0.46091621
8  -1.26506123
9  -0.68685285
10 -0.44566197
11  1.22408180
12  0.35981383
13  0.40077145
14  0.11068272
15 -0.55584113
16  1.78691314
17  0.49785048
18 -1.96661716
19  0.70135590
20 -0.47279141

Upvotes: 1

AntoniosK
AntoniosK

Reputation: 16121

I think rle will do the job based on your objective, but I'll post an alternative solution, with a little bit more data manipulation, but you'll be able to get some more info.

Typically, after answering one question you'll be asked to answer more. For example, get an idea of the distribution of the runs for positives vs. negatives, as the maximum value might not tell you much. Or, perform a statistical comparison to see if positives run longer than the negatives on average.

Also, with ~3000 rows I don't think you'll have any speed issues.

library(dplyr)

# example dataset
dt = data.frame(x = c(1,-1,-2,0,2,4,3,5,-5,-6,-7,0,0))

# get a dataset that assigns an id to all positive or negative series
dt %>%
  mutate(sign = ifelse(x > 0, "pos", ifelse(x < 0, "neg", "zero")), # get the sign of the value
         sign_lag = lag(sign, default = sign[1]),       # get previous value (exception in the first place)
         change = ifelse(sign != sign_lag, 1 , 0),      # check if there's a change
         series_id = cumsum(change)+1) %>%              # create the series id
  print() -> dt2                                        # print to screen and save it

#     x sign sign_lag change series_id
# 1   1  pos      pos      0         1
# 2  -1  neg      pos      1         2
# 3  -2  neg      neg      0         2
# 4   0 zero      neg      1         3
# 5   2  pos     zero      1         4
# 6   4  pos      pos      0         4
# 7   3  pos      pos      0         4
# 8   5  pos      pos      0         4
# 9  -5  neg      pos      1         5
# 10 -6  neg      neg      0         5
# 11 -7  neg      neg      0         5
# 12  0 zero      neg      1         6
# 13  0 zero     zero      0         6

You can remove the help columns sign_lag and change as I've included them only to demonstrate how the process works.

# Get longest runs
dt2 %>% 
  count(sign, series_id) %>%
  group_by(sign) %>%
  filter(n == max(n)) %>%
  select(-series_id) %>%
  ungroup

# # A tibble: 3 x 2
#    sign     n
#   <chr> <int>
# 1   neg     3
# 2   pos     4
# 3  zero     2


# Get all runs
dt2 %>% count(sign, series_id)

# # A tibble: 6 x 3
#    sign series_id     n
#   <chr>     <dbl> <int>
# 1   neg         2     2
# 2   neg         5     3
# 3   pos         1     1
# 4   pos         4     4
# 5  zero         3     1
# 6  zero         6     2

If you get all runs you will be able to plot distributions of the runs of positives vs. negatives, or perform statistical comparisons to see if the positives run longer on average than the negative.

Upvotes: 1

programmer777
programmer777

Reputation: 127

This is a simple manner to do it, I assumed that you began with a dataframe with only the values I also assumed you need a dataframe with the counts of the same digits run a column for the positive and a column for the negative.

set.seed(42)
df=data.frame(x= runif(300, -1.0, 1.0))
count_pos=c()
count_neg=c()
n1=df$x[1]
if (sign(n1)==1){
  count_pos[1]=1
  count_neg[1]=0
}else{
  count_neg[1]=1
  count_pos[1]=0
}
count=1
index=1
for (i in df$x[2:nrow(df)]){
  #print (i)
  index=index+1
  if (sign(n1)==sign(i)){
    count=count+1
  }
  else{
    count=1
  }
  if (sign(i)==1){
    count_pos[index]=count
    count_neg[index]=0
  }else{
    count_pos[index]=0
    count_neg[index]=count
  }
  n1=i
}

df2=data.frame(x=df$x,count_pos=count_pos,count_neg=count_neg)
#df2 contains the dataframe with columns that count the run for numbers with 
#same sign
print (paste("Maximum run of Positive numbers:",max(count_pos)))
print (paste("Maximun run of negative numbers:",max(count_neg)))

I know there is a more compact way to do it but this code makes the trick.

Upvotes: 1

Related Questions