Reputation: 1305
I have a data frame of about 3000 rows. I wish to find the longest run of positive and negative numbers.
My example dataframe: df
1 0.502310591
2 -0.247577976
3 -0.307256769 2
4 0.442253678
5 -0.795770351
6 2.08244648
7 -0.01672777
8 -0.164145656 2
9 0.610117365
10 0.014758371
11 0.381105476
12 0.721386493 4
13 -0.363222383
14 0.201409322
15 0.724867214
16 -1.586829584
17 1.066288451
18 0.182824494
19 0.237447191
20 -0.215475797
Longest positive run: 4 Longest negative run: 2
I am following this tutorial: https://ocw.mit.edu/ans7870/18/18.05/s14/html/r-tut-rle.html
I need to fund the longest run of values >0 and also <0. So is there any way to edit the above?
I guess this only finds longest lengths of 1,0? If that case then i would need a helper column ifelse 1,0 to split the negative , positive... then perhaps find longest length with this code:
> df$wins <- ifelse(df$V2 > 0, 1, 0)
> df$loss <- ifelse(df$V2 < 0, 1, 0)
> win <- (c(df$wins))
> max(rle(win)$lengths)
[1] 4
This works for finding maximum wins...
This is for losses:
> print(df$loss)
[1] 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1
> df$loss <- ifelse(df$V2 < 0, 1, 0)
> print(df$loss)
[1] 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1
> loss <- (c(df$loss))
> max(rle(loss)$lengths)
[1] 4
Not sure why it says 4... there is clearly 2 maximum losses, anyone see why? What am i missing, shouldn't the logic of the wins work exactly the same for the loss? I cant see any error in the code...
if value in df$loss is less than 0 print 1 else 0.
make a vector containing the contents of df$loss column
find max length of 1's using max(rle(loss)$lengths)
again result is 4.. however, its clearly 2?
Upvotes: 0
Views: 3223
Reputation: 2986
To get the maximum positive/negative runs in a series ( in your case in a column of a data.frame the rle
function is all you need:
set.seed(123)
df <- data.frame(col1=rnorm(20, mean = 0, sd = 1))
max. pos run:
max(rle(sign(df$col1))[[1]][rle(sign(df$col1))[[2]] == 1])
[1] 5
max. neg.run:
max(rle(sign(df$col1))[[1]][rle(sign(df$col1))[[2]] == -1])
[1] 3
let’s check the results:
> df
col1
1 -0.56047565
2 -0.23017749
3 1.55870831
4 0.07050839
5 0.12928774
6 1.71506499
7 0.46091621
8 -1.26506123
9 -0.68685285
10 -0.44566197
11 1.22408180
12 0.35981383
13 0.40077145
14 0.11068272
15 -0.55584113
16 1.78691314
17 0.49785048
18 -1.96661716
19 0.70135590
20 -0.47279141
Upvotes: 1
Reputation: 16121
I think rle
will do the job based on your objective, but I'll post an alternative solution, with a little bit more data manipulation, but you'll be able to get some more info.
Typically, after answering one question you'll be asked to answer more. For example, get an idea of the distribution of the runs for positives vs. negatives, as the maximum value might not tell you much. Or, perform a statistical comparison to see if positives run longer than the negatives on average.
Also, with ~3000 rows I don't think you'll have any speed issues.
library(dplyr)
# example dataset
dt = data.frame(x = c(1,-1,-2,0,2,4,3,5,-5,-6,-7,0,0))
# get a dataset that assigns an id to all positive or negative series
dt %>%
mutate(sign = ifelse(x > 0, "pos", ifelse(x < 0, "neg", "zero")), # get the sign of the value
sign_lag = lag(sign, default = sign[1]), # get previous value (exception in the first place)
change = ifelse(sign != sign_lag, 1 , 0), # check if there's a change
series_id = cumsum(change)+1) %>% # create the series id
print() -> dt2 # print to screen and save it
# x sign sign_lag change series_id
# 1 1 pos pos 0 1
# 2 -1 neg pos 1 2
# 3 -2 neg neg 0 2
# 4 0 zero neg 1 3
# 5 2 pos zero 1 4
# 6 4 pos pos 0 4
# 7 3 pos pos 0 4
# 8 5 pos pos 0 4
# 9 -5 neg pos 1 5
# 10 -6 neg neg 0 5
# 11 -7 neg neg 0 5
# 12 0 zero neg 1 6
# 13 0 zero zero 0 6
You can remove the help columns sign_lag
and change
as I've included them only to demonstrate how the process works.
# Get longest runs
dt2 %>%
count(sign, series_id) %>%
group_by(sign) %>%
filter(n == max(n)) %>%
select(-series_id) %>%
ungroup
# # A tibble: 3 x 2
# sign n
# <chr> <int>
# 1 neg 3
# 2 pos 4
# 3 zero 2
# Get all runs
dt2 %>% count(sign, series_id)
# # A tibble: 6 x 3
# sign series_id n
# <chr> <dbl> <int>
# 1 neg 2 2
# 2 neg 5 3
# 3 pos 1 1
# 4 pos 4 4
# 5 zero 3 1
# 6 zero 6 2
If you get all runs you will be able to plot distributions of the runs of positives vs. negatives, or perform statistical comparisons to see if the positives run longer on average than the negative.
Upvotes: 1
Reputation: 127
This is a simple manner to do it, I assumed that you began with a dataframe with only the values I also assumed you need a dataframe with the counts of the same digits run a column for the positive and a column for the negative.
set.seed(42)
df=data.frame(x= runif(300, -1.0, 1.0))
count_pos=c()
count_neg=c()
n1=df$x[1]
if (sign(n1)==1){
count_pos[1]=1
count_neg[1]=0
}else{
count_neg[1]=1
count_pos[1]=0
}
count=1
index=1
for (i in df$x[2:nrow(df)]){
#print (i)
index=index+1
if (sign(n1)==sign(i)){
count=count+1
}
else{
count=1
}
if (sign(i)==1){
count_pos[index]=count
count_neg[index]=0
}else{
count_pos[index]=0
count_neg[index]=count
}
n1=i
}
df2=data.frame(x=df$x,count_pos=count_pos,count_neg=count_neg)
#df2 contains the dataframe with columns that count the run for numbers with
#same sign
print (paste("Maximum run of Positive numbers:",max(count_pos)))
print (paste("Maximun run of negative numbers:",max(count_neg)))
I know there is a more compact way to do it but this code makes the trick.
Upvotes: 1