Reputation: 93938
Take this simple data frame of linked ids:
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
> test
id1 id2
1 10 1
2 10 36
3 1 24
4 1 45
5 24 300
6 8 11
I now want to group together all the ids which link. By 'link', I mean follow through the chain of links so that all ids in one group are labelled together. A kind of branching structure. i.e:
Group 1
10 --> 1, 1 --> (24,45)
24 --> 300
300 --> NULL
45 --> NULL
10 --> 36, 36 --> NULL,
Final group members: 10,1,24,36,45,300
Group 2
8 --> 11
11 --> NULL
Final group members: 8,11
Now I roughly know the logic I would want, but don't know how I would implement it elegantly. I am thinking of a recursive use of match
or %in%
to go down each branch, but am truly stumped this time.
The final result I would be chasing is:
result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11))
> result
group id
1 1 10
2 1 1
3 1 24
4 1 36
5 1 45
6 1 300
7 2 8
8 2 11
Upvotes: 7
Views: 2152
Reputation: 93938
Here's an alternative answer that I have discovered myself after the nudging in the right direction by Josh. This answer uses the igraph
package.
For those that are searching and come across this answer, my test
dataset is referred to as an "edge list" or "adjacency list" in graph theory (http://en.wikipedia.org/wiki/Graph_theory)
library(igraph)
test <- data.frame(id1=c(10,10,1,1,24,8 ),id2=c(1,36,24,45,300,11))
gr.test <- graph_from_data_frame(test)
links <- data.frame(id=unique(unlist(test)),group=components(gr.test)$membership)
links[order(links$group),]
# id group
#1 10 1
#2 1 1
#3 24 1
#5 36 1
#6 45 1
#7 300 1
#4 8 2
#8 11 2
Upvotes: 8
Reputation: 2907
You said recursive... and I thought I'd be super terse while I'm at it.
Test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
Recursive function to get the groupings
aveminrec <- function(v1,v2){
v2 <- ave(v1,by = v2,FUN = min)
if(identical(v1,v2)){
as.numeric(as.factor(v2))
}else{
aveminrec(v2,v1)
}
}
Prep data and simplify after
groupvalues <- function(valuepairs){
val <- unlist(valuepairs)
grp <- aveminrec(val,1:nrow(valuepairs))
unique(data.frame(grp,val)[order(grp,val), ])
}
Get results
groupvalues(test)
groupvalues(mytest)
aveminrec() is probably along the lines of what you were thinking, though I bet there's a way to be more direct about going down each branch instead of repeating ave() which is essentially split() and lapply(). Maybe recursively split and lapply? As it is, it's like repeated partial branching, or alternately simplifying 2 vectors slightly without group information loss.
Maybe parts of this would be used on a real problem, but groupvalues() is way too dense to read without some comments at least. I also haven't checked how performance compares to a for loop with ave and flipping the groups that way.
Upvotes: 0
Reputation: 2907
Without using packages:
# 2 sets of test data
mytest <- data.frame(id1=c(10,10,3,1,1,24,8,11,32,11,45),id2=c(1,36,50,24,45,300,11,8,32,12,49))
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
grouppairs <- function(df){
# from wide to long format; assumes df is 2 columns of related id's
test <- data.frame(group = 1:nrow(df),val = unlist(df))
# keep moving to next pair until all same values have same group
i <- 0
while(any(duplicated(unique(test)$val))){
i <- i+1
# get group of matching values
matches <- test[test$val == test$val[i],'group']
# change all groups with matching values to same group
test[test$group %in% matches,'group'] <- test$group[i]
}
# renumber starting from 1 and show only unique values in group order
test$group <- match(test$group, sort(unique(test$group)))
unique(test)[order(unique(test)$group), ]
}
# test
grouppairs(test)
grouppairs(mytest)
Upvotes: 2
Reputation: 162441
The Bioconductor package RBGL (an R interface to the BOOST graph library) contains
a function, connectedComp()
, which identifies the connected components in a graph --
just what you are wanting.
(To use the function, you will first need to install the graph and RBGL packages, available here and here.)
library(RBGL)
test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
## Convert your 'from-to' data to a 'node and edge-list' representation
## used by the 'graph' & 'RBGL' packages
g <- ftM2graphNEL(as.matrix(test))
## Extract the connected components
cc <- connectedComp(g)
## Massage results into the format you're after
ld <- lapply(seq_along(cc),
function(i) data.frame(group = names(cc)[i], id = cc[[i]]))
do.call(rbind, ld)
# group id
# 1 1 10
# 2 1 1
# 3 1 24
# 4 1 36
# 5 1 45
# 6 1 300
# 7 2 8
# 8 2 11
Upvotes: 7