Jaap
Jaap

Reputation: 83215

How to identify rows within a certain boundary of a condition and calculate the differences?

Suppose I have the following dataset:

> dat
       value id
 1: 2.785300  1
 2: 4.164371  1
 3: 4.179532  1
 4: 4.373546  1
 5: 4.378759  1
 6: 4.694612  1
 7: 4.955066  2
 8: 4.983810  2
 9: 5.183643  2
10: 5.329508  2
11: 5.389843  2
12: 5.487429  2
13: 5.575781  3
14: 5.738325  3
15: 5.943836  3
16: 6.124931  3
17: 6.511781  3
18: 6.595281  3

which can be created with:

set.seed(1)
library(data.table)
dat <- data.table(value=round(rnorm(18, mean = 5, sd = 1),6))[order(value)][,id:=rep(1:3, each=6)]

Next thing I did is creating a variable which indicates in which row a new id occurs first and give that row a specific changeid with all other rows having NA values with:

dat[, changeid:=ifelse(+(!(id==shift(id,n=1L,type="lag")))==1,1,NA)
    ][, changeid:=rleid(changeid)[changeid==1]]

which gives:

> dat
       value id changeid
 1: 2.785300  1       NA
 2: 4.164371  1       NA
 3: 4.179532  1       NA
 4: 4.373546  1       NA
 5: 4.378759  1       NA
 6: 4.694612  1       NA
 7: 4.955066  2        2
 8: 4.983810  2       NA
 9: 5.183643  2       NA
10: 5.329508  2       NA
11: 5.389843  2       NA
12: 5.487429  2       NA
13: 5.575781  3        4
14: 5.738325  3       NA
15: 5.943836  3       NA
16: 6.124931  3       NA
17: 6.511781  3       NA
18: 6.595281  3       NA

Now I want to creat two new variables:

  1. A variable called window which has the same changeid value for all rows within a certain boundary of value (e.g.: within 0.2 of the row which has a changeid value). The other rows have NA's.
  2. A variable called iddif with the difference with the value for row which has a changeid. The other rows have NA's.

The desired result:

> dat
       value id changeid window      iddif
 1: 2.785300  1       NA     NA         NA
 2: 4.164371  1       NA     NA         NA
 3: 4.179532  1       NA     NA         NA
 4: 4.373546  1       NA     NA         NA
 5: 4.378759  1       NA     NA         NA
 6: 4.694612  1       NA     NA         NA
 7: 4.955066  2        2      2   0.000000
 8: 4.983810  2       NA      2   0.028744
 9: 5.183643  2       NA     NA         NA
10: 5.329508  2       NA     NA         NA
11: 5.389843  2       NA      4  -0.185938
12: 5.487429  2       NA      4  -0.088352
13: 5.575781  3        4      4   0.000000
14: 5.738325  3       NA      4   0.162544
15: 5.943836  3       NA     NA         NA
16: 6.124931  3       NA     NA         NA
17: 6.511781  3       NA     NA         NA
18: 6.595281  3       NA     NA         NA

Any ideas how to get to this desired result?

Bonus question: How do get the changeid to start at 1 and then increment with 1 for the next occurance of an id-change?

It would be nice if the solution(s) make use of data.table as well.

Upvotes: 3

Views: 167

Answers (1)

David Arenburg
David Arenburg

Reputation: 92282

Here's a possible solution using foverlaps

First, I would create changeid as follows (as per the bonus)

dat[c(0L, diff(id)) == 1L, changeid := 1:.N]

Then, I would create a temp data set where changeid isn't NA, key it, create interval columns in dat and run foverlaps over them. Then, extract matched rows and update the original data

temp <- dat[!is.na(changeid), .(start = value, end = value)] # temp data
dat[, `:=`(start = value - 0.2, end = value + 0.2)] # set boundries
setkey(temp) # key the smaller data
res <- foverlaps(dat, temp, which = TRUE, nomatch = 0L) # get matched incidents
dat[res$xid, `:=`(window = res$yid, iddif = temp$start[res$yid])] # update values
dat[!is.na(window), iddif := value - iddif] # calculate differences
dat
#        value id changeid    start      end window     iddif
#  1: 2.785300  1       NA 2.585300 2.985300     NA        NA
#  2: 4.164371  1       NA 3.964371 4.364371     NA        NA
#  3: 4.179532  1       NA 3.979532 4.379532     NA        NA
#  4: 4.373546  1       NA 4.173546 4.573546     NA        NA
#  5: 4.378759  1       NA 4.178759 4.578759     NA        NA
#  6: 4.694612  1       NA 4.494612 4.894612     NA        NA
#  7: 4.955066  2        1 4.755066 5.155066      1  0.000000
#  8: 4.983810  2       NA 4.783810 5.183810      1  0.028744
#  9: 5.183643  2       NA 4.983643 5.383643     NA        NA
# 10: 5.329508  2       NA 5.129508 5.529508     NA        NA
# 11: 5.389843  2       NA 5.189843 5.589843      2 -0.185938
# 12: 5.487429  2       NA 5.287429 5.687429      2 -0.088352
# 13: 5.575781  3        2 5.375781 5.775781      2  0.000000
# 14: 5.738325  3       NA 5.538325 5.938325      2  0.162544
# 15: 5.943836  3       NA 5.743836 6.143836     NA        NA
# 16: 6.124931  3       NA 5.924931 6.324931     NA        NA
# 17: 6.511781  3       NA 6.311781 6.711781     NA        NA
# 18: 6.595281  3       NA 6.395281 6.795281     NA        NA

(You can remove start and end if don't like them)

Upvotes: 4

Related Questions