Nile
Nile

Reputation: 333

Is there a better way to match each elements of a vector to a set of defined intervals?

I have a set of defined intervals and I need to categorized each elements of a vector, based on the interval it lies in. Here is an example. Because my data is going to be very large, I wonder if there is a more efficient way to do that.

library(dplyr)
Intrevals <- data.frame(start = c( 3, 6, 10, 58, 78), end = c(4, 8, 46, 68, 84 ), catg = c("F", "R", "O", "S", "X"))
x = data.frame( unit = c(1:100), value = sample(1:84, 100, replace = TRUE))
E = x %>%
  tidyr::crossing(Intrevals) %>%
  dplyr::mutate(flag = ifelse((value >= start) & (value < end), 1, 0)) %>%
  dplyr::filter(flag == 1)

Thanks for any suggestions.

Upvotes: 2

Views: 66

Answers (2)

chinsoon12
chinsoon12

Reputation: 25225

Here are 4 options using data.table:

First, set some keys for Intervals and DT for options 1 - 3:

setDT(Intervals, key=c("start","end"))
DT <- as.data.table(x, key="value")

1) using non-equi join and by=.EACHI:

DT[, catg1 := Intervals[DT, on=.(start<=value, end>=value), by=.EACHI, catg]$catg]

2) using 2 rolling joins to find catg based on start or end and is within interval only when both are the same

DT[, catg2 := {
    s <- Intervals[DT, on=.(start=value), roll=Inf, mult="first", catg]
    e <- Intervals[DT, on=.(end=value), roll=-Inf, mult="last", catg]
    fifelse(s==e, s, NA_character_)
}]

3) Also rolling joins but flipping the left and right tables as one is shorter than the other and might be faster:

Intervals[, sr := DT[.SD, on=.(value=start), roll=-Inf, mult="first", which=TRUE]]
Intervals[, er := DT[.SD, on=.(value=end), roll=Inf, mult="last", which=TRUE]]
ix <- Intervals[, .(lst=seq(er ,sr), values=rep(catg, er - sr + 1L)), 1L:nrow(Intervals)]
DT[, catg3 := replace(rep(NA_character_, .N), ix$lst, ix$values)]

4) using foverlaps:

DT2 <- setkey(as.data.table(x)[, c("start", "end") := .(value, value)], start, end)
DT[, catg4 := foverlaps(DT2, Intervals)$catg]   #using DT and not DT2 for checking correctness

head of output:

     unit value catg catg2 catg3 catg4
  1:    8     1 <NA>  <NA>  <NA>  <NA>
  2:   82     1 <NA>  <NA>  <NA>  <NA>
  3:  100     1 <NA>  <NA>  <NA>  <NA>
  4:   49     2 <NA>  <NA>  <NA>  <NA>
  5:    1     3    F     F     F     F
  6:    2     3    F     F     F     F
  7:    3     4    F     F     F     F
  8:    4     4    F     F     F     F
  9:   45     6    R     R     R     R
 10:   18     7    R     R     R     R
 11:   81    10    O     O     O     O
 12:   59    13    O     O     O     O
 13:   65    13    O     O     O     O

data:

library(data.table)
set.seed(0L)
Intervals <- data.frame(start = c( 3, 6, 10, 58, 78), end = c(4, 8, 46, 68, 84 ), 
    catg = c("F", "R", "O", "S", "X"))
x = data.frame( unit = c(1:100), value = c(3,3,4,4,sample(1:84, 96, replace = TRUE)))

Would be interested to know the performance of each option.

Upvotes: 1

G5W
G5W

Reputation: 37641

You can do this in base R using cut. Since your sample statement generated values that were out of range, I modified it so that all values are within one of the ranges.

Intervals <- data.frame(start = c( 3, 6, 10, 58, 78), 
    end = c(4, 8, 46, 68, 84 ), 
    catg = c("F", "R", "O", "S", "X"))
x = data.frame( unit = c(1:100), value = sample(3:84, 100, replace = TRUE))
B = c(Intervals$start[1], Intervals$end)
cut(x$value, breaks=B, labels=Intervals$catg, include.lowest=T)

  [1] X O O O X S O O O O S X X O O X O O O O S O O O O O O S X R X O R S O S S S O
 [40] X O X R S O S S F O X X O S S F S O O X O S O S O O O O F S O O O O O O O S X
 [79] S O O S X X X F O X X O O R O X O O O X X S
Levels: F R O S X

Upvotes: 1

Related Questions