bogeyman
bogeyman

Reputation: 179

R creating a cluster based on overlapping /intersecting rows

I have the following data frame in R that has overlapping data in the two columns a_sno and b_sno

a_sno<- c(4,5,5,6,6,7,9,9,10,10,10,11,13,13,13,14,14,15,21,21,21,22,23,23,24,25,183,184,185,185,200)
b_sno<-c(5,4,6,5,7,6,10,13,9,13,14,15,9,10,14,10,13,11,22,23,24,21,21,25,21,23,185,185,183,184,200)
df = data.frame(a_sno, b_sno) 

If you take a close look at the data you can see that the 4,5,6&7 intersect/ overlap and I need to put them into a group called 1. Like wise 9,10,13,14 into group 2, 11 and 15 into group 3 etc.... and 200 is not intersecting with any other row but still need to be assigned its own group.

The resulting output should look like this:

---------
group|sno
---------
1    | 4
1    | 5
1    | 6
1    | 7
2    | 9
2    | 10
2    | 13
2    | 14
3    | 11
3    | 15
4    | 21
4    | 22
4    | 23
4    | 24
4    | 25
5    | 183
5    | 184
5    | 185
6    | 200

Any help to get this done is much appreciated. Thanks

Upvotes: 3

Views: 518

Answers (1)

NicE
NicE

Reputation: 21443

Probably not the most efficient solution but you could use graphs to do this:

#sort the data by row and remove duplicates
df = unique(t(apply(df,1,sort)))

#load the library
library(igraph)

#make a graph with your data
graph <-graph.data.frame(df)

#decompose it into components
components <- decompose.graph(graph)

#get the vertices of the subgraphs
result<-lapply(seq_along(components),function(i){
  vertex<-as.numeric(V(components[[i]])$name)
  cbind(rep(i,length(vertex)),vertex)
  })

#make the final dataframe
output<-as.data.frame(do.call(rbind,result))
colnames(output)<-c("group","sno")
output

Upvotes: 3

Related Questions