Reputation: 6200

Eliminating rows from a data.frame

I have this example data.frame:

df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), level = c(1,5,2,3,6,4,2,1,1))

> df
     id start end level
1     a   100 150     1
2 a,b,c   100 350     5
3   d,e   400 550     2
4     d   400 450     3
5     h   800 850     6
6     e   500 550     4
7     i   900 950     2
8     b   200 250     1
9     c   300 350     1

where each row is a linear interval. As this example shows some rows are merged intervals (rows 2 and 3).

What I'd like to do is for each merged interval either eliminate all its individual parts from df if the df$level of the merged interval is greater than that of all its parts, or if the df$level of the merged interval is smaller than at least one of its parts eliminate the merged interval.

So for this example, the output should be:

> res.df
     id start end level
1 a,b,c   100 350     5
2     d   400 450     3
3     h   800 850     6
4     e   500 550     4
5     i   900 950     2

Upvotes: 3

Answers (2)

MrFlick

Reputation: 206232

Method 1 (ID values)

So If we can assume that all the "merged" group have ID names that are a comma separated list of the individual groups, then we can tackle this problem just looking at the IDs and ignore the start/end information. Here is one such method

First, find all the "merged" groups by finding the IDs with commas

groups<-Filter(function(x) length(x)>1, 
    setNames(strsplit(as.character(df$id),","),df$id))

Now, for each of those groups, determine who has the larger level, either the merged group or one of the individual groups. Then return the index of the rows to drop as a negative number

drops<-unlist(lapply(names(groups), function(g) {
    mi<-which(df$id==g)
    ii<-which(df$id %in% groups[[g]])
    if(df[mi, "level"] > max(df[ii, "level"])) {
        return(-ii)
    } else {
        return(-mi)
    }
}))

And finally, drop those from the data.frame

df[drops,]

#      id start end level
# 2 a,b,c   100 350     5
# 4     d   400 450     3
# 5     h   800 850     6
# 6     e   500 550     4
# 7     i   900 950     2

Method 2 (Start/End Graph)

I wanted to also try a method that ignored the (very useful) merged ID names and just looked at the start/end positions. I may have gone off in a bad direction but this lead me to think of it as a network/graph type problem so I used the igraph library.

I created a graph where each vertex represented a start/end position. Each edge therefore represented a range. I used all the ranges from the sample data set and filled in any missing ranges to make the graph connected. I merged that data together to create an edge list. For each edge, I remember the "level" and "id" values from the original data set. Here's the code to do that

library(igraph)

poslist<-sort(unique(c(df$start, df$end)))
seq.el<-embed(rev(poslist),2)
class(seq.el)<-"character"
colnames(seq.el)<-c("start","end")

el<-rbind(df[,c("start","end","level", "id")],data.frame(seq.el, level=0, id=""))
el<-el[!duplicated(el[,1:2]),]

gg<-graph.data.frame(el)

And that creates a graph that looks like

start graph

So basically we want to eliminate cycles in the graph by taking the path with the edge that has the maximum "level" value. Unfortunately since this isn't a normal path-weighting scheme, I didn't find an easy way to do this with a default algorithm (maybe I missed it). So I had to write my own graph transversal function. It's not as pretty as I would have liked, but here it is.

findPath <- function(gg, fromv, tov) {
    if ((missing(tov) && length(incident(gg, fromv, "in"))>1) || 
        (!missing(tov) && V(gg)[fromv]==V(gg)[tov])) {
        return (list(level=0, path=numeric()))
    }
    es <- E(gg)[from(fromv)]
    if (length(es)>1) {
        pp <- lapply(get.edges(gg, es)[,2], function(v) {
            edg <- E(gg)[fromv %--% v]
            lvl <- edg$level
            nxt <- findPaths(gg,v)
            return (list(level=max(lvl, nxt$level), path=c(edg,nxt$path)))
        })
        lvl <- sapply(pp, `[[`, "level")
        take <- pp[[which.max(lvl)]]
        nxt <- findPaths(gg, get.edges(gg, tail(take$path,1))[,2], tov)
        return (list(level=max(take$level, nxt$level), path=c(take$path, nxt$path)))
    } else  {
        lvl <- E(gg)[es]$level
        nv <- get.edges(gg,es)[,2]
        nxt <- findPaths(gg, nv, tov)
        return (list(level=max(lvl, nxt$level), path=c(es, nxt$path)))
    }
}

This will find a path between two nodes that satisfies the property of having a maximal level when presented with a branch. We call that with this data set with

rr <- findPaths(gg, "100","950")$path

This will find the final path. Since each row in the original df data.frame is represented by an edge, we just need to extract the edges from the path that correspond to the final path. This actually gives us a path that looks like

final path

where the red path is the chosen one. I can then subset df with

df[df$id %in% na.omit(E(gg)[rr]$id), ]

#      id start end level
# 2 a,b,c   100 350     5
# 4     d   400 450     3
# 5     h   800 850     6
# 6     e   500 550     4
# 7     i   900 950     2

Method 3 (Overlap Matrix)

He's another way to look at the start/stop positions. I create a matix where columns correspond to ranges in the rows of the data.frame and the rows of the matrix correspond to positions. Each value in the matrix is true if a range overlaps a position. Here I use the between.R helper function

#find unique positions and create overlap matrix
un<-sort(unique(unlist(df[,2:3])))    
cc<-sapply(1:nrow(df), function(i) between(un, df$start[i], df$end[i]))

#partition into non-overlapping sections
groups<-cumsum(c(F,rowSums(cc[-1,]& cc[-nrow(cc),])==0))

#find the IDs to keep from each section
keeps<-lapply(split.data.frame(cc, groups), function(m) {
    lengths <- colSums(m)
    mx <- which.max(lengths)
    gx <- setdiff(which(lengths>0), mx)
    if(length(gx)>0) {
        if(df$level[mx] > max(df$level[gx])) {
            mx
        } else {
            gx
        }
    } else {
        mx
    }
})

This will give a list of the IDs to keep from each group, and we can get the final data.set with

df[unlist(keeps),]

Method 4 (Open/Close Listing)

I have one last method. This one might be the most scalable. We basically melt the positions and keep track of opening and closing events to identify the groups. Then we split and see if the longest in each group has the max level or not. Ultimately we return the IDs. This method uses all standard base functions.

#create open/close listing
dd<-rbind(
    cbind(df[,c(1,4)],pos=df[,2], evt=1),
    cbind(df[,c(1,4)],pos=df[,3], evt=-1)
)

#annotate with useful info
dd<-dd[order(dd$pos, -dd$evt),]
dd$open <- cumsum(dd$evt)
dd$group <- cumsum(c(0,head(dd$open,-1)==0))
dd$width <- ave(dd$pos, dd$id, FUN=function(x) diff(range(x)))

#slim down
dd <- subset(dd, evt==1,select=c("id","level","width","group"))

#process each group
ids<-unlist(lapply(split(dd, dd$group), function(x) {
    if(nrow(x)==1) return(x$id)
    mw<-which.max(x$width)
    ml<-which.max(x$level)
    if(mw==ml) {
        return(x$id[mw])
    } else {
        return(x$id[-mw])
    }
}))

and finally subset

df[df$id %in% ids, ]

by now I think you know what this returns

Summary

So if your real data has the same type of IDs as the sample data, obviously method 1 is a better, more direct choice. I'm still hoping there is a way to simplify method 2 that i'm just missing. I've not done any testing on efficiency or performance of these methods. I'm guessing method 4 might be be the most efficient since it should scale linearly.

Upvotes: 5

Neal Fultz

Reputation: 9687

I'll take a procedural approach; basically, sort descending by level, and for each record, remove later records that have a matching id.

df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), 
                 level = c(1,5,2,3,6,4,2,1,1), stringsAsFactors=FALSE)

#sort
ids <- df[order(df$level, decreasing=TRUE), "id"]

#split
ids <- sapply(df$id, strsplit, ",")

i <- 1

while( i < length(ids)) {
  current <- ids[[i]]

  j <- i + 1
  while(j <= length(ids)) {
    if(any(ids[[j]] %in% current)) 
      ids[[j]] <- NULL
    else 
      j <- j + 1
  }
  i <- i + 1
}

And finally, only keep the ids that are left:

R> ids <- data.frame(id=names(ids), stringsAsFactors=FALSE)

R> merge(ids, df, sort=FALSE)
     id start end level
1     h   800 850     6
2 a,b,c   100 350     5
3     e   500 550     4
4     d   400 450     3
5     i   900 950     2

This has ugly while loops because R only has for-each loops, and also note the stringsAsFactors=FALSE is necessary for splitting the ids. Deleting middle elements could be bad for performance, but that will depend on the underlying implementation R uses for lists (linked vs arrays).

Upvotes: 1