Reputation: 6200
I have this example data.frame
:
df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350), level = c(1,5,2,3,6,4,2,1,1))
> df
id start end level
1 a 100 150 1
2 a,b,c 100 350 5
3 d,e 400 550 2
4 d 400 450 3
5 h 800 850 6
6 e 500 550 4
7 i 900 950 2
8 b 200 250 1
9 c 300 350 1
where each row is a linear interval. As this example shows some rows are merged intervals (rows 2 and 3).
What I'd like to do is for each merged interval either eliminate all its individual parts from df
if the df$level
of the merged interval is greater than that of all its parts, or if the df$level
of the merged interval is smaller than at least one of its parts eliminate the merged interval.
So for this example, the output should be:
> res.df
id start end level
1 a,b,c 100 350 5
2 d 400 450 3
3 h 800 850 6
4 e 500 550 4
5 i 900 950 2
Upvotes: 3
Views: 157
Reputation: 206232
So If we can assume that all the "merged" group have ID names that are a comma separated list of the individual groups, then we can tackle this problem just looking at the IDs and ignore the start/end information. Here is one such method
First, find all the "merged" groups by finding the IDs with commas
groups<-Filter(function(x) length(x)>1,
setNames(strsplit(as.character(df$id),","),df$id))
Now, for each of those groups, determine who has the larger level, either the merged group or one of the individual groups. Then return the index of the rows to drop as a negative number
drops<-unlist(lapply(names(groups), function(g) {
mi<-which(df$id==g)
ii<-which(df$id %in% groups[[g]])
if(df[mi, "level"] > max(df[ii, "level"])) {
return(-ii)
} else {
return(-mi)
}
}))
And finally, drop those from the data.frame
df[drops,]
# id start end level
# 2 a,b,c 100 350 5
# 4 d 400 450 3
# 5 h 800 850 6
# 6 e 500 550 4
# 7 i 900 950 2
I wanted to also try a method that ignored the (very useful) merged ID names and just looked at the start/end positions. I may have gone off in a bad direction but this lead me to think of it as a network/graph type problem so I used the igraph
library.
I created a graph where each vertex represented a start/end position. Each edge therefore represented a range. I used all the ranges from the sample data set and filled in any missing ranges to make the graph connected. I merged that data together to create an edge list. For each edge, I remember the "level" and "id" values from the original data set. Here's the code to do that
library(igraph)
poslist<-sort(unique(c(df$start, df$end)))
seq.el<-embed(rev(poslist),2)
class(seq.el)<-"character"
colnames(seq.el)<-c("start","end")
el<-rbind(df[,c("start","end","level", "id")],data.frame(seq.el, level=0, id=""))
el<-el[!duplicated(el[,1:2]),]
gg<-graph.data.frame(el)
And that creates a graph that looks like
So basically we want to eliminate cycles in the graph by taking the path with the edge that has the maximum "level" value. Unfortunately since this isn't a normal path-weighting scheme, I didn't find an easy way to do this with a default algorithm (maybe I missed it). So I had to write my own graph transversal function. It's not as pretty as I would have liked, but here it is.
findPath <- function(gg, fromv, tov) {
if ((missing(tov) && length(incident(gg, fromv, "in"))>1) ||
(!missing(tov) && V(gg)[fromv]==V(gg)[tov])) {
return (list(level=0, path=numeric()))
}
es <- E(gg)[from(fromv)]
if (length(es)>1) {
pp <- lapply(get.edges(gg, es)[,2], function(v) {
edg <- E(gg)[fromv %--% v]
lvl <- edg$level
nxt <- findPaths(gg,v)
return (list(level=max(lvl, nxt$level), path=c(edg,nxt$path)))
})
lvl <- sapply(pp, `[[`, "level")
take <- pp[[which.max(lvl)]]
nxt <- findPaths(gg, get.edges(gg, tail(take$path,1))[,2], tov)
return (list(level=max(take$level, nxt$level), path=c(take$path, nxt$path)))
} else {
lvl <- E(gg)[es]$level
nv <- get.edges(gg,es)[,2]
nxt <- findPaths(gg, nv, tov)
return (list(level=max(lvl, nxt$level), path=c(es, nxt$path)))
}
}
This will find a path between two nodes that satisfies the property of having a maximal level when presented with a branch. We call that with this data set with
rr <- findPaths(gg, "100","950")$path
This will find the final path. Since each row in the original df
data.frame is represented by an edge, we just need to extract the edges from the path that correspond to the final path. This actually gives us a path that looks like
where the red path is the chosen one. I can then subset df
with
df[df$id %in% na.omit(E(gg)[rr]$id), ]
# id start end level
# 2 a,b,c 100 350 5
# 4 d 400 450 3
# 5 h 800 850 6
# 6 e 500 550 4
# 7 i 900 950 2
He's another way to look at the start/stop positions. I create a matix where columns correspond to ranges in the rows of the data.frame and the rows of the matrix correspond to positions. Each value in the matrix is true if a range overlaps a position. Here I use the between.R helper function
#find unique positions and create overlap matrix
un<-sort(unique(unlist(df[,2:3])))
cc<-sapply(1:nrow(df), function(i) between(un, df$start[i], df$end[i]))
#partition into non-overlapping sections
groups<-cumsum(c(F,rowSums(cc[-1,]& cc[-nrow(cc),])==0))
#find the IDs to keep from each section
keeps<-lapply(split.data.frame(cc, groups), function(m) {
lengths <- colSums(m)
mx <- which.max(lengths)
gx <- setdiff(which(lengths>0), mx)
if(length(gx)>0) {
if(df$level[mx] > max(df$level[gx])) {
mx
} else {
gx
}
} else {
mx
}
})
This will give a list of the IDs to keep from each group, and we can get the final data.set with
df[unlist(keeps),]
I have one last method. This one might be the most scalable. We basically melt the positions and keep track of opening and closing events to identify the groups. Then we split and see if the longest in each group has the max level or not. Ultimately we return the IDs. This method uses all standard base functions.
#create open/close listing
dd<-rbind(
cbind(df[,c(1,4)],pos=df[,2], evt=1),
cbind(df[,c(1,4)],pos=df[,3], evt=-1)
)
#annotate with useful info
dd<-dd[order(dd$pos, -dd$evt),]
dd$open <- cumsum(dd$evt)
dd$group <- cumsum(c(0,head(dd$open,-1)==0))
dd$width <- ave(dd$pos, dd$id, FUN=function(x) diff(range(x)))
#slim down
dd <- subset(dd, evt==1,select=c("id","level","width","group"))
#process each group
ids<-unlist(lapply(split(dd, dd$group), function(x) {
if(nrow(x)==1) return(x$id)
mw<-which.max(x$width)
ml<-which.max(x$level)
if(mw==ml) {
return(x$id[mw])
} else {
return(x$id[-mw])
}
}))
and finally subset
df[df$id %in% ids, ]
by now I think you know what this returns
So if your real data has the same type of IDs as the sample data, obviously method 1 is a better, more direct choice. I'm still hoping there is a way to simplify method 2 that i'm just missing. I've not done any testing on efficiency or performance of these methods. I'm guessing method 4 might be be the most efficient since it should scale linearly.
Upvotes: 5
Reputation: 9687
I'll take a procedural approach; basically, sort descending by level, and for each record, remove later records that have a matching id.
df <- data.frame(id=c("a","a,b,c","d,e","d","h","e","i","b","c"), start=c(100,100,400,400,800,500,900,200,300), end=c(150,350,550,450,850,550,950,250,350),
level = c(1,5,2,3,6,4,2,1,1), stringsAsFactors=FALSE)
#sort
ids <- df[order(df$level, decreasing=TRUE), "id"]
#split
ids <- sapply(df$id, strsplit, ",")
i <- 1
while( i < length(ids)) {
current <- ids[[i]]
j <- i + 1
while(j <= length(ids)) {
if(any(ids[[j]] %in% current))
ids[[j]] <- NULL
else
j <- j + 1
}
i <- i + 1
}
And finally, only keep the ids that are left:
R> ids <- data.frame(id=names(ids), stringsAsFactors=FALSE)
R> merge(ids, df, sort=FALSE)
id start end level
1 h 800 850 6
2 a,b,c 100 350 5
3 e 500 550 4
4 d 400 450 3
5 i 900 950 2
This has ugly while loops because R only has for-each loops, and also note the stringsAsFactors=FALSE
is necessary for splitting the ids. Deleting middle elements
could be bad for performance, but that will depend on the underlying implementation
R uses for lists (linked vs arrays).
Upvotes: 1