Reputation: 85
I have a data frame with 2718 observations of which one column is of interest. It is a first difference series created with diff()
. For ease, let's create a fake vector that resembles the data and pretend v
is a first difference series. The NA
s are introduced to make it similar to the original data.
# Create fake first difference series vector v
v <- runif(2718, -0.05, 0.05)
v <- append(NA, diff(v))
# Insert NAs at the beginning and end
v[c(1:8, 2712:2718)] <- NA
# Insert some NAs at random places in v
ind <- which(v %in% sample(v, 7))
v[ind] <- NA
I am interested in the sequences of v
that show an increasing and decreasing behaviour. Specifically, I would like to extract the sequences of v
that consecutively increase and decrease, respectively. In an increasing sequence, each element of v
cannot be less than its preceding element and in a decreasing sequence, each element of v
cannot be greater than its preceding element. Try to picture this when plotting v
: Whenever the curve does not decrease (i.e. goes up or stays level), it is an increasing sequence and whenever the curve does not increase (i.e. goes down or stays level), it is a decreasing sequence.
To clarify, the procedure may be explained by this:
i
in v
and compare it to the preceding one i-1
i
is greater than or equal to i-1
, the sequence qualifies as increasing; if i
is less than or equal to i-1
, the sequence qualifies as decreasing.i
th elementi-1
to i
(i.e. i-1
and i
are equal), the sequence continues, just as it does when a NA
occursAs v
is a first difference series, the extracted element i
(3rd bullet point) already represents the increase/decrease. For now, I do not want to limit the length of the sequences, hence a sequence might already be given by two elements.
I imagine storing the i
th elements of v
in a new vector (e.g. inc.v
and dec.v
) and afterwards finding maximum and mean increase/decrease of the sequences, as well as maximum and mean length of these sequences. The elements should be stored in inc.v
or dec.v
with respect to their original positions in v
, so I can track them back. Each sequence in inc.v
and dec.v
should be easy to distinguish when they are separated by NA
elements.
I tried writing this with a for loop and conditional statements but did not get far:
inc.v <- NULL
dec.v <- NULL
for (i in 2:length(v)) {
if(!v[i] < v[i-1] | is.na(v[i])) {
inc.v[i] <- v[i]
} else if (!v[i] > v[i-1] | is.na(v[i])) {
dec.v[i] <- v[i]
} else {
next
}
}
The if
and else if
statements represent the fifth bullet point. I am aware of the problem that when i
equals i-1
, it can qualify as both an increasing and decreasing sequence and it should be added to whatever sequence was stored previously. I just have no idea how to implement that. I think the sequences will be quite short as the data is noisy and periods of no decrease/no increase won't prevail for long. Hence, it might be a good idea to also try this operation with e.g. a 50/100 points moving mean:
# A symmetric 50 points moving average for v
f50 <- rep(1/51,51)
v_smooth <- filter(v, f50, sides = 2)
When running the loop as of now, the evaluation of the first condition results in an NA
, giving me the error:
Error in if (!v[i] < v[i - 1] | is.na(v[i])) { :
missing value where TRUE/FALSE needed
I do not quite understand what's happening here because the is.na()
statement should secure a TRUE
or FALSE
argument?!
Happy to hear your thoughts!
Upvotes: 1
Views: 892
Reputation: 5673
You should vectorize instead of looping, and use direct conditions on your difference vector to create new column that contain your inc and dec. It works the same when you want to smooth. Here is an example:
library(data.table)
plouf <- setDT(list( v = v, diff = c(NA,diff(v))))
plouf[diff > 0,inc := v]
plouf[diff < 0, dec := v]
f50 <- rep(1/51,51)
plouf[,v_smooth := filter(v, f50, sides = 2)]
plouf[,diff_smooth :=c(NA,diff(v_smooth))]
plouf[diff_smooth > 0,inc_smooth := v_smooth]
plouf[diff_smooth < 0, dec_smooth := v_smooth]
To extract the decrease value, you need to create a grouping variable, that increase at each change of the diff, so we can perform whatever operation on each increasing or decreasing sequence using by
plouf[,grouptmp := abs(c(NA,diff(ifelse(diff>0,1,0))))]
plouf[is.na(grouptmp),grouptmp:= 0]
plouf[,group := cumsum(grouptmp)]
plouf[,decvalue := dec[.N] - dec[1], by = group]
plouf[,incvalue := inc[.N]-inc[1], by = group]
v diff inc dec group decvalue grouptmp
1: NA NA NA NA 0 NA 0
2: NA NA NA NA 0 NA 0
3: NA NA NA NA 0 NA 0
4: NA NA NA NA 0 NA 0
5: NA NA NA NA 0 NA 0
6: NA NA NA NA 0 NA 0
7: NA NA NA NA 0 NA 0
8: NA NA NA NA 0 NA 0
9: -0.0344851657 NA NA NA 0 NA 0
10: 0.0788633499 0.1133485156 0.0788633499 NA 0 NA 0
11: -0.0415118591 -0.1203752090 NA -0.0415118591 1 0.000000000 1
12: 0.0557818390 0.0972936981 0.0557818390 NA 2 NA 1
13: -0.0314433977 -0.0872252367 NA -0.0314433977 3 0.000000000 1
14: 0.0098391432 0.0412825409 0.0098391432 NA 4 NA 1
15: -0.0147885296 -0.0246276728 NA -0.0147885296 5 0.000000000 1
16: -0.0009157661 0.0138727635 -0.0009157661 NA 6 NA 1
17: 0.0303060166 0.0312217827 0.0303060166 NA 6 NA 0
18: -0.0384165912 -0.0687226078 NA -0.0384165912 7 -0.005185349 1
19: -0.0436019399 -0.0051853487 NA -0.0436019399 7 -0.005185349 0
20: 0.0821260908 0.1257280307 0.0821260908 NA 8 NA 1
21: -0.0172987636 -0.0994248545 NA -0.0172987636 9 -0.003255037 1
22: -0.0205538005 -0.0032550369 NA -0.0205538005 9 -0.003255037 0
23: -0.0114417208 0.0091120797 -0.0114417208 NA 10 NA 1
24: 0.0524503477 0.0638920686 0.0524503477 NA 10 NA 0
25: -0.0105871856 -0.0630375333 NA -0.0105871856 11 -0.047042624 1
26: -0.0576298093 -0.0470426237 NA -0.0576298093 11 -0.047042624 0
27: 0.0031608195 0.0607906288 0.0031608195 NA 12 NA 1
28: -0.0009828784 -0.0041436979 NA -0.0009828784 13 0.000000000 1
29: 0.0167153471 0.0176982255 0.0167153471 NA 14 NA 1
30: 0.0088964230 -0.0078189241 NA 0.0088964230 15 -0.033234568 1
31: 0.0065035882 -0.0023928348 NA 0.0065035882 15 -0.033234568 0
32: -0.0243381450 -0.0308417332 NA -0.0243381450 15 -0.033234568 0
You can then easily find the greatest or do whatever you want.
Upvotes: 2
Reputation: 1255
Here is an attempt to answer your question (note that I slightly changed your example)
# Create fake first difference series vector v
v <- runif(2718, -0.05, 0.05)
v <- append(NA, diff(v))
# Insert NAs at the beginning and end
v[c(1:8, 2712:2718)] <- NA
# Insert some NAs at random places in v
v[sample(1:length(v), 7)] <- NA
# a couple of equal values
v[10:15] <- 1
# create an empty vector of character
out <- character(length(v)-1)
tmp <- diff(v)
# known increase
out[tmp>0] <- "I"
# known decrease
out[tmp<0] <- "D"
# no change
out[tmp == 0] <- "E"
# known NA
out[is.na(tmp)] <- NA
# let change E for the right value (I or D) if no way to know, I by default
for (i in 1:length(out)) {
if (!is.na(out[i]) & out[i] == "E") {
if (i==1) {
out[i] <- "I"
} else {
if (is.na(out[i-1])) {
out[i] <- "I"
} else out[i] <- out[i-1]
}
}
}
# Retrieve values
dec.v <- inc.v <- rep(NA_real_, length(v))
idi <- which(out == "I")+1
inc.v[idi] <- v[idi]
idd <- which(out == "I")+1
dec.v[idd] <- v[idd]
Also regarding the error in your loop, you have to change the order of the elements in your logical test, i.e is.na()
first so that no test is triggered while v[i]
is actually NA
.
Hope this help :)
Upvotes: 1
Reputation: 2502
You should really try for a vectorized approach, this is probably a clearer way to find runs of increasing or decreasing sequences:
library(data.table)
data <- as.data.table(v)
data[, vl := shift(v, 1)]
data[, runs := rleid(vl > v)]
using data.table library
Upvotes: 1