Reputation: 333
I have a set of defined intervals and I need to categorized each elements of a vector, based on the interval it lies in. Here is an example. Because my data is going to be very large, I wonder if there is a more efficient way to do that.
library(dplyr)
Intrevals <- data.frame(start = c( 3, 6, 10, 58, 78), end = c(4, 8, 46, 68, 84 ), catg = c("F", "R", "O", "S", "X"))
x = data.frame( unit = c(1:100), value = sample(1:84, 100, replace = TRUE))
E = x %>%
tidyr::crossing(Intrevals) %>%
dplyr::mutate(flag = ifelse((value >= start) & (value < end), 1, 0)) %>%
dplyr::filter(flag == 1)
Thanks for any suggestions.
Upvotes: 2
Views: 66
Reputation: 25225
Here are 4 options using data.table
:
First, set some keys for Intervals
and DT
for options 1 - 3:
setDT(Intervals, key=c("start","end"))
DT <- as.data.table(x, key="value")
1) using non-equi join and by=.EACHI
:
DT[, catg1 := Intervals[DT, on=.(start<=value, end>=value), by=.EACHI, catg]$catg]
2) using 2 rolling joins to find catg
based on start
or end
and is within interval only when both are the same
DT[, catg2 := {
s <- Intervals[DT, on=.(start=value), roll=Inf, mult="first", catg]
e <- Intervals[DT, on=.(end=value), roll=-Inf, mult="last", catg]
fifelse(s==e, s, NA_character_)
}]
3) Also rolling joins but flipping the left and right tables as one is shorter than the other and might be faster:
Intervals[, sr := DT[.SD, on=.(value=start), roll=-Inf, mult="first", which=TRUE]]
Intervals[, er := DT[.SD, on=.(value=end), roll=Inf, mult="last", which=TRUE]]
ix <- Intervals[, .(lst=seq(er ,sr), values=rep(catg, er - sr + 1L)), 1L:nrow(Intervals)]
DT[, catg3 := replace(rep(NA_character_, .N), ix$lst, ix$values)]
4) using foverlaps
:
DT2 <- setkey(as.data.table(x)[, c("start", "end") := .(value, value)], start, end)
DT[, catg4 := foverlaps(DT2, Intervals)$catg] #using DT and not DT2 for checking correctness
head of output:
unit value catg catg2 catg3 catg4
1: 8 1 <NA> <NA> <NA> <NA>
2: 82 1 <NA> <NA> <NA> <NA>
3: 100 1 <NA> <NA> <NA> <NA>
4: 49 2 <NA> <NA> <NA> <NA>
5: 1 3 F F F F
6: 2 3 F F F F
7: 3 4 F F F F
8: 4 4 F F F F
9: 45 6 R R R R
10: 18 7 R R R R
11: 81 10 O O O O
12: 59 13 O O O O
13: 65 13 O O O O
data:
library(data.table)
set.seed(0L)
Intervals <- data.frame(start = c( 3, 6, 10, 58, 78), end = c(4, 8, 46, 68, 84 ),
catg = c("F", "R", "O", "S", "X"))
x = data.frame( unit = c(1:100), value = c(3,3,4,4,sample(1:84, 96, replace = TRUE)))
Would be interested to know the performance of each option.
Upvotes: 1
Reputation: 37641
You can do this in base R using cut
. Since your sample
statement generated values that were out of range, I modified it so that all values are within one of the ranges.
Intervals <- data.frame(start = c( 3, 6, 10, 58, 78),
end = c(4, 8, 46, 68, 84 ),
catg = c("F", "R", "O", "S", "X"))
x = data.frame( unit = c(1:100), value = sample(3:84, 100, replace = TRUE))
B = c(Intervals$start[1], Intervals$end)
cut(x$value, breaks=B, labels=Intervals$catg, include.lowest=T)
[1] X O O O X S O O O O S X X O O X O O O O S O O O O O O S X R X O R S O S S S O
[40] X O X R S O S S F O X X O S S F S O O X O S O S O O O O F S O O O O O O O S X
[79] S O O S X X X F O X X O O R O X O O O X X S
Levels: F R O S X
Upvotes: 1