Caleb
Caleb

Reputation: 53

How to drop observations with inter-row difference being less than a specific value

I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
            t = as.Date(c(27, 32:34, 36:41))) 
# that is:
#        x          t
#  1: 10.0 1970-01-28
#  2: 10.5 1970-02-02
#  3: 11.1 1970-02-03
#  4: 14.0 1970-02-04
#  5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
#  6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
#  7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
#  8: 17.0 1970-02-09
#  9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11

For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:

I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this

#        x          t
#  1: 10.0 1970-01-31
#  2: 10.5 1970-02-02
#  3: 11.1 1970-02-03
#  4: 14.0 1970-02-04
#  7: 14.6 1970-02-08
#  8: 17.0 1970-02-09
# 10: 30.0 1970-02-11

Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.

Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.

Thank you in advanced!

Upvotes: 1

Views: 299

Answers (3)

arg0naut91
arg0naut91

Reputation: 14764

If I understood correctly, you could do:

library(data.table)

z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
  , filt := ifelse(x == filt, 
                   shift(x, fill = x[1]), 
                   filt)][
                     x - filt >= 0.5 | x == filt, ][, filt := NULL]

Explanation:

  • First we calculate the minimum of x by each group;
  • Group is created by cumsum(c(1, +(x >= shift(x) + 0.5)[-1])). Therein, we check for each row whether x >= shift(x) + 0.5 (difference between x and previous row is larger or equal to 0.5). This evaluates to TRUE or FALSE which we turn to 1 and 0 with the + sign; as the first row will always be NA (as there is no previous one), we remove it with [-1] after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum - the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);
  • There will be rows with only 1 row per previously created groups; in this case, we need to cross-check for difference with the exact previous row. In all other cases we cross-check for difference with the first row of the group (i.e. last row which should not be deleted according to criteria as it was larger than previous one + 0.5);
  • After that we just remove those rows which don't satisfy the condition plus we keep the row which is equal to itself (will always be the first one); we remove the filtering variable at the end.

Output:

      x          t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11

Upvotes: 2

Soren
Soren

Reputation: 2425

As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.

z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), 
                t = as.Date(c(27, 32:34, 36:41))) 
setkeyv(z,"t")

find_gaps <- function(dt) {
  dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
  gaps <- dt[,abs(x-last_x) < 0.5,]
  gap <- which(gaps==TRUE)[1]
  #print(paste0("Removing row: ",gap))
  return (gap)
}

while(!is.na(gap<-find_gaps(z))) { z <- z[-gap] }

z

Results:

[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
      x          t last_x   gap
1: 10.0 1970-01-28     NA FALSE
2: 10.5 1970-02-02   10.0 FALSE
3: 11.1 1970-02-03   10.5 FALSE
4: 14.0 1970-02-04   11.1 FALSE
5: 14.6 1970-02-08   14.0 FALSE
6: 17.0 1970-02-09   14.6 FALSE
7: 30.0 1970-02-11   17.0 FALSE

Alternate

Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient

z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()

for (i in 2:length(x))
{
  if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
  else last_x <- x[i]
}
z1 <- z1[-(gaps)]

Benchmarking

microbenchmark::microbenchmark(times=100,
  forway={
    z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
    x <- z1$x; last_x <- x[1];  gaps <- c()

    for (i in 2:length(x)) { if (abs(x[i]-last_x) < 0.5) { gaps <- c(gaps,i); } else { last_x <- x[i]; } }
    z1 <- z1[-(gaps)]
  },
  datatableway={
    z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")

    z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 | x == filt, ][, filt := NULL]
  },
  whileway={
    z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")

    find_gaps <- function(dt) {
      dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
      gaps <- dt[,abs(x-last_x) < 0.5,]
      which(gaps==TRUE)[1]
    }
    while(!is.na(gap<-find_gaps(z3))) { z3 <- z3[-gap] }
  }
)

(z1==z2) & (z2==z3[,.(x,t)])

Results:

Unit: milliseconds
         expr       min        lq      mean    median        uq      max neval
       forway  2.741609  3.607341  4.067566  4.069382  4.556219  5.61997   100
 datatableway  7.552005  8.915333  9.839475  9.606205 10.762764 15.46430   100
     whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271   100
> 
> (z1==z2) & (z2==z3[,.(x,t)])
        x    t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE

Upvotes: 1

Sonny
Sonny

Reputation: 3183

You can use dplyr::mutate and filter:

z %>%
  mutate(diff = lead(x, 1) - x) %>%
  filter(diff >= 0.5 | is.na(diff)) %>%
  select(-diff)

I kept diff field for easy understanding purpose. You can do this in single filter statement also

Upvotes: 0

Related Questions