Reputation: 751
I need to find stretches of values above 0 in a numeric vector where there are at least 10 members within each region. I do not want to check every single position as it would be very time intensive (vector is over 10 million).
Here is what I'm trying to do (very preliminary as I can't figure out how to skip increments in for loop):
1. Check if x[i] (start position) is positive.
a) if positive, check to see if x[i+10] (end position) is positive (since we want at least length 10 of positive integers)
* if positive, check every position in between to see if positive
* if negative, move to x[i+11], skip positions (e.g. new start position is x[i+12]) in between start & end positions since we would not get >10 members if negative end position is included.
x <- rnorm(50, mean=0, sd=4)
for(i in 1:length(x)){
if(x[i]>0){ # IF START POSITION IS POSITIVE
flag=1
print(paste0(i, ": start greater than 1"))
if(x[i+10]>0){ # IF END POSITION POSITIVE, THEN CHECK ALL POSITIONS IN BETWEEN
for(j in i+1:i+9){
if(x[j]>0){ # IF POSITION IS POSITIVE, CHECK NEXT POSITION IF POSITIVE
print(paste0(j, ": for j1"))
}else{ # IF POSITION IS NEGATIVE, THEN SKIP CHECKING & SET NEW START POSITION
print(paste0(j, ": for j2"))
i <- i+11
break;
}
}
}else{ # IF END POSITION IS NOT POSITIVE, START CHECK ONE POSITION AFTER END POSITION
i <- i+11
}
}
}
The issue I have is that even when I manually increment i
, the for loop i
value masks the new set value. Appreciate any insight.
Upvotes: 1
Views: 502
Reputation: 4686
Vectorised solution using only basic commands:
x <- runif(1e7,-1,1) # generate random vector
y <- which(x<=0) # find boundaries i.e. negatives and zeros
dif <- y[2:length(y)] - y[1:(length(y)-1)] # find distance in boundaries
drange <- which(dif > 10) # find distances more than 10
starts <- y[drange]+1 # starting positions of sequence
ends <- y[drange+1]-1 # last positions of sequence
The first range you want is from x[starts[1]]
to x[ends[1]]
, etc.
Upvotes: 1
Reputation: 21502
I dunno if this approach is as efficient as Curt F's, but how about
runs <- rle(x>0)
And then working with the regions defined by runs$lengths>10 & runs$values ==TRUE
?
Upvotes: 2
Reputation: 4824
Here is a solution that finds stretches of ten positive numbers in a vector of length ten million. It does not use the loop approach suggested in the OP.
The idea here is to take the cumulative sum of the logical expression vec>0
. The difference between position n and n-10 will be 10 only if all values of the vector at positions between n-10 and n are positive.
filter
is an easy and relatively fast way to calculate these differences.
#generate random data
vec <- runif(1e7,-1,1)
#cumulative sum
csvec <- cumsum(vec>0)
#construct a filter that will find the difference between the nth value with the n-10th value of the cumulative sign vector
f11 <- c(1,rep(0,9),-1)
#apply the filter
fv <- filter(csvec, f11, sides = 1)
#find where the difference as computed by the filter is 10
inds <- which(fv == 10)
#check a few results
> vec[(inds[1]-9):(inds[1])]
[1] 0.98457526 0.03659257 0.77507743 0.69223183 0.70776891 0.34305865 0.90249491 0.93019927 0.18686722 0.69973176
> vec[(inds[2]-9):(inds[2])]
[1] 0.0623790 0.8489058 0.3783840 0.8781701 0.6193165 0.6202030 0.3160442 0.3859175 0.8416434 0.8994019
> vec[(inds[200]-9):(inds[200])]
[1] 0.0605163 0.7921233 0.3879834 0.6393018 0.2327136 0.3622615 0.1981222 0.8410318 0.3582605 0.6530633
#check all the results
> prod(sapply(1:length(inds),function(x){prod(sign(vec[(inds[x]-9):(inds[x])]))}))
[1] 1
I played around with system.time()
to see how long the various steps took. On my not-very-powerful laptop the longest step was filter()
, which took just over half a second for a vector of length ten million.
Upvotes: 1