Reputation: 3064
I cannot process graph with 230M edges. I cloned apache.spark, built it and then tried it on cluster.
I use Spark Standalone Cluster:
-5 machines (each has 12 cores/32GB RAM)
-'spark.executor.memory' == 25g
-'spark.driver.memory' == 3g
Graph has 231359027 edges. And its file weights 4,524,716,369 bytes. Graph is represented in text format:
sourceVertexId destinationVertexId
My code:
object Canonical {
def main(args: Array[String]) {
val numberOfArguments = 3
require(args.length == numberOfArguments, s"""Wrong argument number. Should be $numberOfArguments .
|Usage: <path_to_grpah> <partiotioner_name> <minEdgePartitions> """.stripMargin)
var graph: Graph[Int, Int] = null
val nameOfGraph = args(0).substring(args(0).lastIndexOf("/") + 1)
val partitionerName = args(1)
val minEdgePartitions = args(2).toInt
val sc = new SparkContext(new SparkConf()
.setSparkHome(System.getenv("SPARK_HOME"))
.setAppName(s" partitioning | $nameOfGraph | $partitionerName | $minEdgePartitions parts ")
.setJars(SparkContext.jarOfClass(this.getClass).toList))
graph = GraphLoader.edgeListFile(sc, args(0), false, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK,
vertexStorageLevel = StorageLevel.MEMORY_AND_DISK, minEdgePartitions = minEdgePartitions)
graph = graph.partitionBy(PartitionStrategy.fromString(partitionerName))
println(graph.edges.collect.length)
println(graph.vertices.collect.length)
}
}
After I run it I encountered number of java.lang.OutOfMemoryError: Java heap space
errors and of course I did not get a result.
Do I have problem in the code? Or in cluster configuration?
Because it works fine for relatively small graphs. But for this graph it never worked. (And I do not think that 230M edges is too big data)
Thank you for any advise!
RESOLVED
I did not put enough memory for driver program. I've changed cluster configuration to:
-4 workers (each has 12 cores/32GB RAM)
-1 master with driver program (each has 12 cores/32GB RAM)
-'spark.executor.memory' == 25g
-'spark.driver.memory' == 25g
And also it was not good idea to collect all vertices and edges to count them. It is easy to do just this: graph.vertices.count
and graph.edges.count
Upvotes: 3
Views: 1501
Reputation: 27455
What I suggest is you do a binary search to find the maximal size of data the cluster can handle. Take 50% of the graph and see if that works. If it does, try 75%. Etc.
My rule of thumb is you need 20–30× the memory for a given size of input. For 4.5 GB this suggests the limit would be around 100 GB. You have exactly that amount. I have no experience with GraphX: it probably adds another multiplier to the memory use. In my opinion you simply don't have enough memory.
Upvotes: 3