Reputation: 715
I have a data table that defines the start and end coordinates for a set of sequnces. For example:
df1 <- data.frame(from = c(7, 22, 35, 21, 50),
to = c(13, 29, 43, 31, 60))
Given start and end coordinates (ie 1 and 100), I am trying to identify all integers not covered by the sequences, with the same output format. For example:
df2 <- data.frame(from = c(1, 14, 32, 44, 61),
to = c(6, 20, 34, 49, 100))
Here is my current attempt, in which I vectorise the sequences in df1 and then identify all integers that do not match the sequence 1:100.
seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
seq <- c(1:100)
df1_int <- unlist(seq2(from = df1$from, to = df1$to))
df1_int <- unique(df1_int)
df2_int <- seq[!seq %in% df1_int]
all(diff(df2_int) == 1)
However, this method is too slow for the dataset I want to apply it to (~100,000,000 integers), and I do not know how to reformat the vector df2_int into a dataframe in the format of df2.
Any help will be greatly appreciated!
NB: The sequences in df1 do not always start with the lowest integer (eg a sequence could run from 13 to 7, as opposed to from 7 to 13). There could also be sequences with only one integer (eg from 7 to 7).
Upvotes: 6
Views: 107
Reputation: 25225
Borrowing idea from How to flatten / merge overlapping time periods but in a data.table
approach instead:
library(data.table)
setDT(df1)
setorder(df1, from, to)
maxn <- 100L
#see linked post
df1[, g := c(0, cumsum(shift(from, -1L) > cummax(to))[-.N])]
#get desired output
df1[, .(from=max(to)+1L, to=min(from)-1L), by=.(g)][,
.(from=c(1L, from), to=c(to, maxn))]
Hopefully, this is fast enough for your actual dataset with 100mio integers.
Upvotes: 1
Reputation: 73242
Since you need a fast solution we could attempt a base R approach using setdiff
and split
. The vectorization we leave to mapply
. To find the factors where to split
we use findInterval
. To get the elements' start and end points of the resulting list we clear with range
.
d <- setdiff(1:100, unlist(mapply(seq.default, df1[, 1], df1[, 2])))
t(sapply(split(d, findInterval(d, d[which(c(1, diff(d)) > 1)])), range))
# [,1] [,2]
# 0 1 6
# 1 14 20
# 2 32 34
# 3 44 49
# 4 61 100
Benchmark
As we can see from the benchmark, we have achieved a pretty fast solution.
Unit: microseconds
expr min lq mean median uq max neval cld
purrr 1575.479 1593.2110 1634.3573 1604.9475 1634.033 2028.095 100 b
findInterval 250.801 256.9245 276.8609 273.3815 281.673 498.285 100 a
Upvotes: 2
Reputation: 28695
Edit: Should have read the question better. This is basically your current approach.
You can pmap
over your input with the seq
function, and unlist
that to get a vector of all values. Then setdiff
to get the missing values. Using diff
and cumsum
you can create a grouping variable for the missing values, grouping them into from-to pairs. Then split the missing value vector by the grouping var and map
over that to create one row of output for each group.
library(purrr)
miss <- setdiff(1:100, unlist(pmap(df1, seq)))
i <-
miss %>%
diff %>%
`>`(1) %>%
rev %>%
cumsum %>%
rev
map_df(split(miss, c(i, 0)), ~list(from = head(.x, 1), to = tail(.x, 1))) %>%
dplyr::arrange(from)
# # A tibble: 5 x 2
# from to
# <int> <int>
# 1 1 6
# 2 14 20
# 3 32 34
# 4 44 49
# 5 61 100
Upvotes: 2