yasel
yasel

Reputation: 453

Conditional manipulation and extension of rows in data.table also considering previous extensions without for-loop

Suppose I have two data.tables:

A <- data.table(
  idx = c(1,2,3),
  leftbound = c(1,134,1546),
  rightbound = c(65, 180, 1670),
  infA = c("infA1", "infA2", "infA3")
)

A
   idx leftbound rightbound  infA
1:   1         1         65 infA1
2:   2       134        180 infA2
3:   3      1546       1670 infA3




B <- data.table(
  breakpoint = c(150, 165, 1555),
  infB = c("infB1", "infB2", "infB3")
)

B

   breakpoint  infB
1:        150 infB1
2:        165 infB2
3:       1555 infB3

In data.table A each row corresponds to a range from a left to a right boundary. It has an index (idx) column, a right and a left boundary column (leftbound and rightbound) and an additional variable (infA). Data.table B includes points which should be inserted as breaking points into the boundaries in the first table. So e.g. range in row 2 from 134 to 180 should be split at 150 and 165. Hence this range should be split in three ranges: 134 - 150, 150 - 165 and 165 to 180. For each of this three ranges there should be a new row substituting the old "unsplit" one.

Hence the Output should look like:

Output
   peak.grp   lb   ub  infA  infB
1:        1    1   65 infA1 infB1
2:        2  134  150 infA2 infB2
3:        2  150  165 infA2 infB2
4:        2  165  180 infA2 infB2
5:        3 1546 1555 infA3 infB3
6:        3 1555 1670 infA3 infB3

Is there some way to achive this without a for-loop?

Upvotes: 4

Views: 120

Answers (2)

Frank
Frank

Reputation: 66819

Same as @Alexis but vectorized instead of lapply over breakpoints:

res <- B[A, on=.(breakpoint >= leftbound, breakpoint <= rightbound), {
  v = c(i.leftbound, head(x.breakpoint, .N), i.rightbound)
  n = c(i.infA, head(x.infB, .N), i.infA)
  .(idx = idx, lb = head(v, -1), rb = tail(v, -1), ln = head(n, -1), rn = tail(n, -1))
}, by=.EACHI][, (1:2) := NULL][]

   idx   lb   rb    ln    rn
1:   1    1   65 infA1 infA1
2:   2  134  150 infA2 infB1
3:   2  150  165 infB1 infB2
4:   2  165  180 infB2 infA2
5:   3 1546 1555 infA3 infB3
6:   3 1555 1670 infB3 infA3

I'm using the head(var, .N) in case the variable is populated with NA because no match is found (but we'll still have .N == 0, so head(var, .N) will have zero length). I think if (.N) var would also work, and maybe be more readable.

Related: https://github.com/Rdatatable/data.table/issues/3452

Upvotes: 3

Alexis
Alexis

Reputation: 5069

I don't really understand how the two infA columns are supposed to be filled, but perhaps this does what you want:

breaker <- function(peak.grp, lb, ub, breaks, infA, infB) {
  if (anyNA(breaks)) {
    data.frame(peak.grp = peak.grp,
               lb = lb,
               ub = ub,
               leftinf = infA,
               rightinf = infA,
               stringsAsFactors = FALSE)
  }
  else {
    breakpoints <- c(lb, breaks, ub)
    inf <- c(infA, infB, infA)

    dfs <- lapply(seq_along(breakpoints)[-length(breakpoints)], function(i) {
      data.frame(lb = breakpoints[i],
                 ub = breakpoints[i + 1L],
                 leftinf = inf[i],
                 rightinf = inf[i + 1L],
                 stringsAsFactors = FALSE)
    })

    data.frame(peak.grp = peak.grp, do.call(rbind, dfs, TRUE))
  }
}

B[A,
  breaker(idx, leftbound, rightbound, x.breakpoint, infA, infB),
  on = .(breakpoint > leftbound, breakpoint < rightbound),
  by = .EACHI
  ][, -(1:2)]
   peak.grp   lb   ub leftinf rightinf
1:        1    1   65   infA1    infA1
2:        2  134  150   infA2    infB1
3:        2  150  165   infB1    infB2
4:        2  165  180   infB2    infA2
5:        3 1546 1555   infA3    infB3
6:        3 1555 1670   infB3    infA3

The command at the end performs a non-equi join to find all breakpoints that lie within the bounds from A, and specifies by = .EACHI to pass each group of matched rows from B to the corresponding row from A; the first 2 columns are then discarded because they are automatically added due to this by = .EACHI, one for each condition in on.

The helper function checks for two cases. If any breakpoint is NA, it means that no row from B lies within the bounds from A, so it simply replicates the input A row as output. Otherwise, it creates the new ranges by concatenating the lower bound, the breakpoints, and the upper bound, and then it takes each consecutive pair inside the lapply call. It does something similar for inf, maybe you can adjust that if it's not what you want.

Upvotes: 2

Related Questions