Reputation: 1748
I am using the spark-submit script to upload my python script to the Spark Cluster, but am getting the following error:
Traceback (most recent call last):
File "/gpfs/fs01/user/sf6d-7c3a9c08343577-05540e1c503a/data/workdir/spark-driver-cc30d6d8-1518-45b1-a4a7-8421deaa3482/2_do_extract.py", line 139, in do_extraction
r = resRDD.collect()
File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/pyspark.zip/pyspark/traceback_utils.py", line 78, in __exit__
self._context._jsc.setCallSite(None)
File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 811, in __call__
answer = self.gateway_client.send_command(command)
File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 624, in send_command
connection = self._get_connection()
File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 579, in _get_connection
connection = self._create_connection()
File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 585, in _create_connection
connection.start()
File "/usr/local/src/spark160master/spark-1.6.0-bin-2.6.0/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 697, in start
raise Py4JNetworkError(msg, e)
Py4JNetworkError: An error occurred while trying to connect to the Java server
>
I am certain that this error is occuring because of the lack of driver-memory when executing the script, because for a smaller sized dataset the script executes successfully and for a larger dataset I get this error.
Reading the spark-submit documentation I have tried all configurations to increase the driver memory, executor memory etc like the following
/bin/sh spark-submit.sh --vcap vcap.json my_python_script.py --master https://169.54.219.20 --deploy-mode cluster --driver-memory 5g --executor-memory 5g --driver-maxResultSize 5g --worker-memory 5g
But it seems to be impossible to change the memory.
Please explain to me how can I set such variables, because even moderate memory usuage is failing.
Upvotes: 0
Views: 246
Reputation: 430
The resources your spark jobs are allowed is determined by the plan you choose when you provision a Bluemix Apache Spark service instance. Consequently, you cannot pick arbitrary settings (memory, executors, etc) on a per spark-submit
job basis. Instead, you will get the max your plan dictates.
For your particular error, what we see is that your application breaks in RDD.collect(), which by definition is: "Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. ". The spark documentation describes: "one can use the collect() method to first bring the RDD to the driver node. ... This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine. (http://spark.apache.org/docs/latest/programming-guide.html). To overcome limited heap size in the driver, application may use RDD.toLocalIterator() to return an iterator that contains all of the elements in this RDD. The iterator will consume as much memory as the largest partition in this RDD. Here the details http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator.
The other thing to consider is whether your program is explicitly setting master to local, which will have significant resource limitations. e.g.. conf = SparkConf().setAppName('myapp').setMaster('local')
Upvotes: 1