Reputation: 3
I am trying to execute some lambda per connected component in graphx of Spark. I get connected components using connectedComponents() method, but then I couldn't find any other way except collecting all distinct vertex ids of the graph with labels assigned to components, and then doing foreach, and getting each component using subgraph() method. But this is sequential process and if my graph has a lot of small components it's not scalable. Can someone help me? Is there a way to say something like connectedComponentsGraph.foreachComponent(lambda)?
Upvotes: 0
Views: 441
Reputation: 35229
I'd recommend using graphframes
:
import org.graphframes._
val graph: Graph = ???
val gdf = GraphFrame.fromGraphX(graph)
val components = gdf.connectedComponents.setAlgorithm("graphx").run()
and follow up with basic SQL:
components
.join(gdf.vertices, Seq("id"))
.join(gdf.edges.select($"src" as "id"), Seq("id"))
.groupBy("component")
.count
Upvotes: 1