Bob
Bob

Reputation: 879

How do I get the size of the largest connected component of a graph in Spark?

I'm building a graph from an RDD of tuples of source and destination nodes, like this:

Graph.fromEdgeTuples(rawEdges = edgeList, 1)
  1. First off, I did not quite understand what the second parameter is. From the documentation,

    defaultValue the vertex attributes with which to create vertices referenced by the edges

    I still don't get it.

  2. Second, I cannot find anything to compute the size of the biggest component. There is no foreach implemented, nor map or reduceByKey, or anything else after invoking the connectedComponents method.

Upvotes: 0

Views: 863

Answers (1)

user7381000
user7381000

Reputation: 31

  1. defaultValue is an attribute assigned to all created edges:

    val graph = Graph.fromEdgeTuples(sc.parallelize(Seq(
      (1, 2), (2, 3), (4, 5))), 1)
    
    graph.edges.map(_.attr).distinct.collect 
    // Array[Int] = Array(1)
    
  2. Extract component ids and do a worcount:

    val ids = graph.connectedComponents.vertices map((v: (Long, Long)) => v._2)
    ids.map((_, 1L)).reduceByKey(_ + _)
    

Upvotes: 3

Related Questions