pyarrow through spark-submit in cluster mode fails

Question

I have a simple Pyspark code

import pyarrow fs = pyarrow.hdfs.connect()

If I run this using spark-submit in "client"mode, it works fine, but in "cluster" mode, throws the error

Traceback (most recent call last):
  File "t3.py", line 17, in 
    fs = pa.hdfs.connect()
  File "/opt/anaconda/3.6/lib/python3.6/site-packages/pyarrow/hdfs.py", line 181, in connect
    kerb_ticket=kerb_ticket, driver=driver)
  File "/opt/anaconda/3.6/lib/python3.6/site-packages/pyarrow/hdfs.py", line 37, in __init__
    self._connect(host, port, user, kerb_ticket, driver)
  File "io-hdfs.pxi", line 99, in pyarrow.lib.HadoopFileSystem._connect
  File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed

All the necessary python libraries are installed on every node in my Hadoop cluster. I have verified by testing this code under pyspark every node individually.

But cannot make it work through spark-submit in cluster mode?

Any ideas?

shankar

pyarrow through spark-submit in cluster mode fails

Answers (0)

Related Questions