Reputation: 453
Suppose I have two data.tables:
A <- data.table(
idx = c(1,2,3),
leftbound = c(1,134,1546),
rightbound = c(65, 180, 1670),
infA = c("infA1", "infA2", "infA3")
)
A
idx leftbound rightbound infA
1: 1 1 65 infA1
2: 2 134 180 infA2
3: 3 1546 1670 infA3
B <- data.table(
breakpoint = c(150, 165, 1555),
infB = c("infB1", "infB2", "infB3")
)
B
breakpoint infB
1: 150 infB1
2: 165 infB2
3: 1555 infB3
In data.table A each row corresponds to a range from a left to a right boundary. It has an index (idx
) column, a right and a left boundary column (leftbound
and rightbound
) and an additional variable (infA
).
Data.table B includes points which should be inserted as breaking points into the boundaries in the first table. So e.g. range in row 2 from 134 to 180 should be split at 150 and 165. Hence this range should be split in three ranges: 134 - 150, 150 - 165 and 165 to 180. For each of this three ranges there should be a new row substituting the old "unsplit" one.
Hence the Output should look like:
Output
peak.grp lb ub infA infB
1: 1 1 65 infA1 infB1
2: 2 134 150 infA2 infB2
3: 2 150 165 infA2 infB2
4: 2 165 180 infA2 infB2
5: 3 1546 1555 infA3 infB3
6: 3 1555 1670 infA3 infB3
Is there some way to achive this without a for-loop?
Upvotes: 4
Views: 120
Reputation: 66819
Same as @Alexis but vectorized instead of lapply
over breakpoints:
res <- B[A, on=.(breakpoint >= leftbound, breakpoint <= rightbound), {
v = c(i.leftbound, head(x.breakpoint, .N), i.rightbound)
n = c(i.infA, head(x.infB, .N), i.infA)
.(idx = idx, lb = head(v, -1), rb = tail(v, -1), ln = head(n, -1), rn = tail(n, -1))
}, by=.EACHI][, (1:2) := NULL][]
idx lb rb ln rn
1: 1 1 65 infA1 infA1
2: 2 134 150 infA2 infB1
3: 2 150 165 infB1 infB2
4: 2 165 180 infB2 infA2
5: 3 1546 1555 infA3 infB3
6: 3 1555 1670 infB3 infA3
I'm using the head(var, .N)
in case the variable is populated with NA because no match is found (but we'll still have .N == 0
, so head(var, .N)
will have zero length). I think if (.N) var
would also work, and maybe be more readable.
Related: https://github.com/Rdatatable/data.table/issues/3452
Upvotes: 3
Reputation: 5069
I don't really understand how the two infA
columns are supposed to be filled,
but perhaps this does what you want:
breaker <- function(peak.grp, lb, ub, breaks, infA, infB) {
if (anyNA(breaks)) {
data.frame(peak.grp = peak.grp,
lb = lb,
ub = ub,
leftinf = infA,
rightinf = infA,
stringsAsFactors = FALSE)
}
else {
breakpoints <- c(lb, breaks, ub)
inf <- c(infA, infB, infA)
dfs <- lapply(seq_along(breakpoints)[-length(breakpoints)], function(i) {
data.frame(lb = breakpoints[i],
ub = breakpoints[i + 1L],
leftinf = inf[i],
rightinf = inf[i + 1L],
stringsAsFactors = FALSE)
})
data.frame(peak.grp = peak.grp, do.call(rbind, dfs, TRUE))
}
}
B[A,
breaker(idx, leftbound, rightbound, x.breakpoint, infA, infB),
on = .(breakpoint > leftbound, breakpoint < rightbound),
by = .EACHI
][, -(1:2)]
peak.grp lb ub leftinf rightinf
1: 1 1 65 infA1 infA1
2: 2 134 150 infA2 infB1
3: 2 150 165 infB1 infB2
4: 2 165 180 infB2 infA2
5: 3 1546 1555 infA3 infB3
6: 3 1555 1670 infB3 infA3
The command at the end performs a non-equi join to find all breakpoints that lie within the bounds from A
,
and specifies by = .EACHI
to pass each group of matched rows from B
to the corresponding row from A
;
the first 2 columns are then discarded because they are automatically added due to this by = .EACHI
,
one for each condition in on
.
The helper function checks for two cases.
If any breakpoint is NA
,
it means that no row from B
lies within the bounds from A
,
so it simply replicates the input A
row as output.
Otherwise, it creates the new ranges by concatenating
the lower bound,
the breakpoints,
and the upper bound,
and then it takes each consecutive pair inside the lapply
call.
It does something similar for inf
,
maybe you can adjust that if it's not what you want.
Upvotes: 2