GraphX does not work with relatively big graphs

Question

I cannot process graph with 230M edges. I cloned apache.spark, built it and then tried it on cluster.

I use Spark Standalone Cluster:

-5 machines (each has 12 cores/32GB RAM)
-'spark.executor.memory' ==  25g
-'spark.driver.memory' == 3g

Graph has 231359027 edges. And its file weights 4,524,716,369 bytes. Graph is represented in text format:

sourceVertexId destinationVertexId

My code:

object Canonical {
  def main(args: Array[String]) {
    val numberOfArguments = 3
    require(args.length == numberOfArguments, s"""Wrong argument number. Should be $numberOfArguments . 
                                                                           |Usage:    """.stripMargin)
    var graph: Graph[Int, Int] = null
    val nameOfGraph = args(0).substring(args(0).lastIndexOf("/") + 1)
    val partitionerName = args(1)
    val minEdgePartitions = args(2).toInt
    val sc = new SparkContext(new SparkConf()
                       .setSparkHome(System.getenv("SPARK_HOME"))
                       .setAppName(s" partitioning | $nameOfGraph | $partitionerName | $minEdgePartitions parts ")
                       .setJars(SparkContext.jarOfClass(this.getClass).toList))
    graph = GraphLoader.edgeListFile(sc, args(0), false, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK,
                                                       vertexStorageLevel = StorageLevel.MEMORY_AND_DISK, minEdgePartitions = minEdgePartitions)
    graph = graph.partitionBy(PartitionStrategy.fromString(partitionerName))
    println(graph.edges.collect.length)
    println(graph.vertices.collect.length)
  }
}

After I run it I encountered number of java.lang.OutOfMemoryError: Java heap space errors and of course I did not get a result. Do I have problem in the code? Or in cluster configuration? Because it works fine for relatively small graphs. But for this graph it never worked. (And I do not think that 230M edges is too big data)

Thank you for any advise!

RESOLVED

I did not put enough memory for driver program. I've changed cluster configuration to:

-4 workers (each has 12 cores/32GB RAM)
-1 master with driver program (each has 12 cores/32GB RAM)
-'spark.executor.memory' ==  25g
-'spark.driver.memory' == 25g

And also it was not good idea to collect all vertices and edges to count them. It is easy to do just this: graph.vertices.count and graph.edges.count

GraphX does not work with relatively big graphs

Answers (1)

Related Questions