Remi.b
Remi.b

Reputation: 18239

Squeeze extreme ranges in a data.frame

I have a data.frame which contain 3 columns named start, end and width. Each line represent a segment over a 1D space with a start, and end and a width such as the "width = end - start + 1"

Here is an example

d = data.frame(
start = c(12, 50, 100, 130, 190),
end   = c(16, 80, 102, 142, 201)
)
d$width = d$end - d$start + 1
print(d)
  start end width
1    12  16     5
2    50  80    31
3   100 102     3
4   130 142    13
5   190 201    12

Consider two breakpoints and a factor of division

UpperPos = 112
LowerPos = 61
factor   = 2

I would like to reduce the width of each segment outside the two breakpoints so that to reduce their width by a factor of factor. If a segment overlaps a breakpoint, then only the part of the segment that is outside this breakpoint should be reduced in width. In addition, the width of each segment must be a multiple of 3 and must be of non-zero length.

Here is my current function that "squeeze" the segments

squeeze = function(d, factor, LowerPos, UpperPos)
{
    for (row in 1:nrow(d))
    {
        if (d[row,]$end <= LowerPos | d[row,]$end >= UpperPos) # Complete squeeze
        {
            middlePos     = round(d[row,]$start + d[row,]$width/2)
            d[row,]$width = round(d[row,]$width / factor)
            d[row,]$width = d[row,]$width - d[row,]$width %% 3 + 3
            d[row,]$start = round(middlePos - d[row,]$width/2)
            d[row,]$end   = d[row,]$start + d[row,]$width -1
        } else if (d[row,]$start <= LowerPos & d[row,]$end >= LowerPos)  # Partial squeeze (Lower)
        {
            d[row,]$start = round(LowerPos - (LowerPos - d[row,]$start)/factor)
            d[row,]$width = d[row,]$end - d[row,]$start + 1
            if (d[row,]$width %% 3 != 0)
            {
                add = 3 - d[row,]$width %% 3
                d[row,]$width = d[row,]$width + add
                d[row,]$start = d[row,]$start - add
            }
        } else if (d[row,]$start >= UpperPos & d[row,]$end <= UpperPos) # Partial squeeze (Upper)
        {
            d[row,]$end     = round(UpperPos + (d[row,]$end - UpperPos)/factor)
            d[row,]$width = d[row,]$end - d[row,]$start + 1
            if (d[row,]$width %% 3 != 0)
            {
                add                     = 3 - d[row,]$width %% 3
                d[row,]$width = d[row,]$width + add
                d[row,]$end   = d[row,]$start + add
            }
        } else if (!(d[row,]$end < UpperPos & d[row,]$start > LowerPos) ) 
        {
            print(d)
            print(paste("row is ",row))
            print(paste("LowerPos is ",LowerPos))
            print(paste("UpperPos is ",UpperPos))
            stop("In MyRanges_squeeze: Should not run this line!")
        }
    }
    return(d)
}

and it returns the expected output

squeeze(d)
  start end width
1    12  14     3
2    54  80    27
3   100 102     3
4   132 140     9
5   192 200     9

However, my function squeeze is way too slow. Can you help me to improve it?

Upvotes: 0

Views: 250

Answers (1)

aichao
aichao

Reputation: 7455

Note that this answer only addresses how one may speed up your function, which is what you asked in your question, and not the validity of your logic with respect to your requirements.

As far as I can tell, all of your operations use vectorized operators. So, there is no need to loop over rows in squeeze. In the following, I have encapsulated all of your code that is within the if-else blocks as separate vectorized functions:

## This computes the case where d$end <= LowerPos | d$end >= UpperPos
f1 <- function(d, factor) {
  middlePos = round(d$start + d$width/2)
  d$width = round(d$width / factor)
  d$width = d$width - d$width %% 3 + 3
  d$start = round(middlePos - d$width/2)
  d$end   = d$start + d$width -1
  d
}

## This is used below in f2
f4 <- function(d) {
  add = 3 - d$width %% 3
  d$width = d$width + add
  d$start = d$start - add
  d
}

## This computes the case where d$start <= LowerPos & d$end >= LowerPos
f2 <- function(d, factor, LowerPos) {
  d$start = round(LowerPos - (LowerPos - d$start)/factor)
  d$width = d$end - d$start + 1
  ifelse(d$width %% 3 != 0, f4(d), d)
}

## This is used below in f3    
f5 <- function(d) {
  add     = 3 - d$width %% 3
  d$width = d$width + add
  d$end   = d$start + add
  d
}

## This computes the case where d$start >= UpperPos & d$end <= UpperPos
f3 <- function(d, factor, UpperPos) {
  d$end   = round(UpperPos + (d$end - UpperPos)/factor)
  d$width = d$end - d$start + 1
  ifelse (d$width %% 3 != 0, f5(d), d)
}

Now, in squeeze, we use f1, f2, and f3 to compute the squeeze for all three cases separately. We also include the case for no squeeze as just d. We then rbind them to one big data frame, dd. Now, all we need is to pick the correct row from each block of rows (each of size nrow(d)) in dd based on the case for that row. For this, we compute a ind for the case (i.e., 1 to 4) using a series of ifelse's. The value of ind is the block to chose from, and its position is the row from that block to choose from. We use this to subset dd to get the output.

squeeze <- function(d, factor, LowerPos, UpperPos) {
  d1 <- f1(d, factor)
  d2 <- f2(d, factor, LowerPos)
  d3 <- f3(d, factor, UpperPos)
  dd <- do.call(rbind,list(d1,d2,d3,d))
  ind <- ifelse(d$end <= LowerPos | d$end >= UpperPos, 1,
                 ifelse(d$start <= LowerPos & d$end >= LowerPos, 2,
                        ifelse(d$start >= UpperPos & d$end <= UpperPos, 3, 4)))
  dd[(ind-1) * nrow(d) + 1:nrow(d),]
}

Using this version, the result is the same as yours:

out <- squeeze(d, factor, LowerPos, UpperPos)
##   start end width
##1     12  14     3
##7     54  80    27
##18   100 102     3
##4    132 140     9
##5    192 200     9

Upvotes: 1

Related Questions