Reputation: 394
When I run my spark python code as below:
import pyspark
conf = (pyspark.SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "512m"))
sc = pyspark.SparkContext(conf = conf) #start the conf
data =sc.textFile('/Users/tsangbosco/Downloads/transactions')
data = data.flatMap(lambda x:x.split()).take(all)
The file is about 20G and my computer have 8G ram, when I run the program in standalone mode, it raises the OutOfMemoryError:
Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:681)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:666)
Is spark unable to deal with file larger than my ram? Could you tell me how to fix it?
Upvotes: 3
Views: 1865
Reputation: 20816
Spark can handle some case. But you are using take
to force Spark to fetch all of the data to an array(in memory). In such case, you should store them to files, like using saveAsTextFile
.
If you are interested in looking at some of data, you can use sample
or takeSample
.
Upvotes: 5