Saurabh
Saurabh

Reputation: 163

Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

Here's what I'm trying:

import pyarrow as pa

conf = {"hadoop.security.authentication": "kerberos"}
fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)

However, when I submit this job to the cluster using Dask-YARN, I get the following error:

  File "test/run.py", line 3
    fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)
  File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 211, in connect
  File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 38, in __init__
  File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed

I also tried setting host (to a name node) and port (=8020), however I run into the same error. Since the error is not descriptive, I'm not sure which setting needs to be altered. Any clues anyone?

Upvotes: 0

Views: 1640

Answers (1)

jiminy_crist
jiminy_crist

Reputation: 2445

Usually the configuration and kerberos ticket are loaded automatically, and you should just be able to connect using

fs = pa.hdfs.connect()

alone. This does require that you have already called kinit (on worker nodes the credentials (but not the ticket) is transferred to the worker environment automatically, no need to do anything). I suggest trying with no parameters locally, then on a worker node.

Upvotes: 1

Related Questions