Reputation: 305
I trying to find algorithm of searching disjoint sets (connected components/union-find) on large amount of data with apache spark. Problem is amount of data. Even Raw representation of graph vertex doesn't fit in to ram on single machine. Edges also doesn't fit in to the ram.
Source data is text file of graph edges on hdfs: "id1 \t id2".
id present as string value, not int.
Naive solution that I found is:
[id1:id2] [id3:id4] [id1:id3]
[id1:[id2;id3]][id3:[id4]]
(flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3]
[id2:id1] -> [id1:id2]
leftOuterJoin
of rdds from stage 3 and 4But this results in the transfer of large amounts of data between nodes (shuffling)
Any advices?
Upvotes: 9
Views: 2337
Reputation: 121
In addition to @Marsellus Wallace answer, below full code to get disjoint sets from an RDD of edges using GraphX.
val edges:RDD[(Long,Long)] = ???
val g = Graph.fromEdgeTuples(edges,-1L)
val disjointSets:RDD[Iterable[Long]] = g.connectedComponents()
//Get tuples with (vertexId,parent vertexId)
.vertices
//Group by parent vertex Id so it aggregates the disjoint set
.groupBy(_._2)
.values
.map(_.map(_._1))
Upvotes: 0
Reputation: 18611
If you are working with graphs I would suggest that you take a look at either one of these libraries
They both provide the connected components algorithm out of the box.
GraphX:
val graph: Graph = ...
val cc = graph.connectedComponents().vertices
GraphFrames:
val graph: GraphFrame = ...
val cc = graph.connectedComponents.run()
cc.select("id", "component").orderBy("component").show()
Upvotes: 2