itpetersen
itpetersen

Reputation: 1515

Merge overlapping intervals in R

I’m trying to merge overlapping intervals to calculate a sum of unique intervals while removing excluded intervals.

Here's a minimal working example:

mydata <- data.frame(interval = c(1,2,3,4,5,6,7,8,9,10),
                     timeoutStart = c(280,500,NA,NA,NA,NA,NA,NA,NA,NA),
                     timeoutEnd = c(310,530,NA,NA,NA,NA,NA,NA,NA,NA),
                     cheeringStart = c(1,181,205,330,460,740,NA,NA,NA,NA),
                     cheeringEnd = c(120,199,300,420,475,760,NA,NA,NA,NA),
                     possessionStart = c(80,180,210,250,350,450,550,650,750,800),
                     possessionEnd = c(130,200,220,280,400,499,600,700,800,950)
)

interval timeoutStart timeoutEnd cheeringStart cheeringEnd possessionStart possessionEnd
       1          280        310             1         120              80           130
       2          500        530           181         199             180           200
       3           NA         NA           205         300             210           220
       4           NA         NA           330         420             250           280
       5           NA         NA           460         475             350           400
       6           NA         NA           740         760             450           499
       7           NA         NA            NA          NA             550           600
       8           NA         NA            NA          NA             650           700
       9           NA         NA            NA          NA             750           800
      10           NA         NA            NA          NA             800           950

In the minimal working example above, I’d like to calculate the total time the team spends cheering or has possession of the ball (excluding timeouts). The values in the matrix represent the start and end times (seconds elapsed since the start of the game) of different intervals for each outcome (timeout, cheering, or possession). The outcomes are not mutually exclusive and can co-occur. However, I don’t want to “double count” the overlapping intervals of cheering and possession. That is, I want to merge the overlapping intervals of cheering and possession, so I can sum the “unique” intervals.

For example, one cheering interval takes place from 740 to 760 seconds, whereas a possession interval overlaps with that interval (750 to 800 seconds). The merged interval would be 740 to 800 seconds (duration = 60 seconds).

After merging the overlapping intervals for cheering and possession, I want to exclude timeout intervals. For instance, for the unique interval from 205 to 300 seconds, I want to exclude the timeout interval from 280 to 310 seconds. So the unique interval excluding the timeout interval would be 205 to 280 seconds (duration = 75 seconds).

I want to calculate the duration of each unique interval (EndStart) excluding the timeout intervals, and then calculate the sum of all of these unique interval durations (excluding the timeout intervals). Lastly, I'd like to be able to include or exclude intervals from the calculation based on the value of another variable (keep = 0 or 1) in that row.

Let’s assume that that the Start and End time columns are not pre-sorted. I would also like the approach to be generalizable to be able to easily add multiple additional column sets to be included in the sum (e.g., dribbling, passing, etc.). I've looked at other answers but haven't found a way to generalize their solutions to my situation.

Upvotes: 0

Views: 1038

Answers (2)

tofd
tofd

Reputation: 620

How about this?

mydata <- data.frame(interval = c(1,2,3,4,5,6,7,8,9,10),
                     timeoutStart = c(280,500,NA,NA,NA,NA,NA,NA,NA,NA),
                     timeoutEnd = c(310,530,NA,NA,NA,NA,NA,NA,NA,NA),
                     cheeringStart = c(1,181,205,330,460,740,NA,NA,NA,NA),
                     cheeringEnd = c(120,199,300,420,475,760,NA,NA,NA,NA),
                     possessionStart = c(80,180,210,250,350,450,550,650,750,800),
                     possessionEnd = c(130,200,220,280,400,499,600,700,800,950),
                     keep = c(rep(FALSE, 2), rep(TRUE, 8)) #added for illustration
)

#add whatever columns you want to use to calculate the merged interval
#they must be in the same order in both vectors
#e.g. if 'cheeringStart' is at index 1, so must 'cheeringEnd'
intervalStartCols <- c('cheeringStart', 'possessionStart')
intervalEndCols <- c('cheeringEnd', 'possessionEnd')
intervalCols <- c(intervalStartCols, intervalEndCols)
timeoutCols <- c('timeoutStart', 'timeoutEnd')

mydata$mergedDuration <- apply(mydata, MARGIN = 1, FUN = function(row){

  #return zero if all NAs
  if(all(is.na(row[intervalCols]))) return(0)

  if(!all(is.na(row[timeoutCols]))){
    timeout.start <- row['timeoutStart']
    timeout.end <- row['timeoutEnd']
  } else {
    timeout.end <- 0
  }

  #identify the maximum time (this will be the end of the merged interval)
  max.end <- max(row[intervalEndCols], na.rm=TRUE)

  #set intial values
  duration <- 0
  segment.complete <- FALSE
  start.i <- which(row[intervalStartCols] == min(row[intervalStartCols], na.rm=TRUE))
  next.step <- row[intervalStartCols][start.i]

  waypoints <- row[intervalCols]
  waypoints <- waypoints[!is.na(waypoints)]
  waypoints <- waypoints[waypoints!=next.step]

  #calculate interval duration adjusting for overlap
  while(next.step < max.end){

    start <- row[intervalStartCols][start.i]

    next.step <- waypoints[waypoints == min(waypoints[waypoints!=next.step])]
    if(segment.complete){
      start.i <- which(row[intervalStartCols] == next.step)
      segment.complete <- FALSE
    }
    end.i <- which(row[intervalEndCols] == next.step)

    waypoints <- waypoints[waypoints!=next.step]

    if(length(end.i) > 0 && length(start.i) >0 && end.i == start.i) {

      segment.start <- row[intervalStartCols][start.i]
      segment.end <- row[intervalEndCols][end.i]
      segment.duration <- segment.end - segment.start

      #adjust for timeout
      timeout.adj <- {
        if (timeout.end == 0) 0 #this is the NA case
        else if(timeout.start > segment.end | timeout.end < segment.start) 0
        else if(timeout.end > segment.end & timeout.start < segment.start) segment.duration
        else if(timeout.end < segment.end) timeout.end - segment.start
        else segment.end - timeout.start
      }

      duration <- duration + segment.duration - timeout.adj
      segment.complete <- TRUE
    }

  }

  duration
})

#sum duration using 'keep' column as mask
summed.duration <- sum(mydata[mydata$keep, 'mergedDuration'])
print(summed.duration)

Upvotes: 1

Wimpel
Wimpel

Reputation: 27802

here is a solutions using data.table's foverlaps(), to perform an overlap-join. It's only a partly solution... providing a desired output would help. But you can probably build on this code to get whataver you want..

assuming your data is named df

library( data.table )

#create data.tables for cheers and possession
cheers.dt <- data.table( interval.cheer = df$interval, 
                     start.cheer = df$cheeringStart, 
                     end.cheer = df$cheeringEnd )[!is.na(start.cheer),]
possession.dt <- data.table( interval.pos = df$interval, 
                             start.pos = df$possessionStart, 
                             end.pos = df$possessionEnd )
#set keys
setkey( cheers.dt, start.cheer, end.cheer )
#perform overlap-join
foverlaps( possession.dt, 
           cheers.dt, 
           by.x = c( "start.pos", "end.pos" ), 
           type = "any", 
           mult = "all", 
           nomatch = NULL )

#    interval.cheer start.cheer end.cheer interval.pos start.pos end.pos
# 1:              1           1       120            1        80     130
# 2:              2         181       199            2       180     200
# 3:              3         205       300            3       210     220
# 4:              3         205       300            4       250     280
# 5:              4         330       420            5       350     400
# 6:              5         460       475            6       450     499
# 7:              6         740       760            9       750     800

I advise you to read about data.table's foverlaps()-function, and non-equi joins.

Upvotes: 2

Related Questions