Reputation: 1428
I have a function my_function which has a for loop iteration. There is a part in the for loop which requires to call a dataframe collect(). It works for the first few loops, but it always crashed at the fifth iteration. Do you guys have any idea why this happens?
File "my_code.py", line 189, in my_function
my_df_collect = my_df.collect()
File "/lib/spark/python/pyspark/sql/dataframe.py", line 280, in collect
port = self._jdf.collectToPython()
File "/lib/spark/python/pyspark/traceback_utils.py", line 78, in __exit__
self._context._jsc.setCallSite(None)
File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 811, in __call__
answer = self.gateway_client.send_command(command)
File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 624, in send_command
connection = self._get_connection()
File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 579, in _get_connection
connection = self._create_connection()
File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 585, in _create_connection
connection.start()
File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 697, in start
raise Py4JNetworkError(msg, e)
Py4JNetworkError: An error occurred while trying to connect to the Java server
Another error message
Exception happened during processing of request from ('127.0.0.1', 55584)
Traceback (most recent call last):
File "/anaconda/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
self.process_request(request, client_address)
File "/anaconda/lib/python2.7/SocketServer.py", line 321, in process_request
self.finish_request(request, client_address)
File "/anaconda/lib/python2.7/SocketServer.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/anaconda/lib/python2.7/SocketServer.py", line 655, in __init__
self.handle()
File "/lib/spark/python/pyspark/accumulators.py", line 235, in handle
num_updates = read_int(self.rfile)
File "/lib/spark/python/pyspark/serializers.py", line 545, in read_int
raise EOFError
EOFError
Upvotes: 0
Views: 5333
Reputation: 838
Perhaps the JVM is overflowing. Try adding memory to Driver, or running df.take(10)
instead of df.collect()
to test if the problem is the amount of data that you are returning.
Upvotes: 2