Kristada673
Kristada673

Reputation: 3744

Connection timeout error while reading a file from HDFS using Python

I have created a single node HDFS in a VM (hadoop.master, IP: 192.168.12.52). The file etc/hadoop/core-site.xml has the following configuration for the namenode:

<configuration>
 <property>
  <name>fs.defaultFS</name>
  <value>hdfs://master.hadoop:9000/</value>
 </property>
</configuration>

I want to read a file from the HDFS on my local, physical desktop. For that, this is my code, which I've saved in a file named hdfs_read.py:

from hdfs import InsecureClient
client = InsecureClient('http://192.168.12.52:9000')
with client.read('/opt/hadoop/LICENSE.txt') as reader:
  features = reader.read()
  print(features)

Now when I run it, I get the following timeout error:

$ python3 hdfs_read.py 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 137, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 91, in create_connection
    raise err
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 81, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 162, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 146, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x7f2d88cef2b0>: Failed to establish a new connection: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 610, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 273, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.12.52', port=9000): Max retries exceeded with url: /webhdfs/v1/home/edhuser/testdata.txt?user.name=embs&offset=0&op=OPEN (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f2d88cef2b0>: Failed to establish a new connection: [Errno 113] No route to host',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "hdfs_read_local.py", line 3, in <module>
    with client.read('/home/edhuser/testdata.txt') as reader:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 678, in read
    buffersize=buffer_size,
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 118, in api_handler
    raise err
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 107, in api_handler
    **self.kwargs
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 207, in _request
    **kwargs
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 437, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='192.168.12.52', port=9000): Max retries exceeded with url: /webhdfs/v1/home/edhuser/testdata.txt?user.name=embs&offset=0&op=OPEN (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f2d88cef2b0>: Failed to establish a new connection: [Errno 113] No route to host',))
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
  File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in <module>
    from apport.report import Report
  File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in <module>
    import apport.fileutils
  File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in <module>
    from apport.packaging_impl import impl as packaging
  File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 23, in <module>
    import apt
  File "/usr/lib/python3/dist-packages/apt/__init__.py", line 23, in <module>
    import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'

Original exception was:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 137, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 91, in create_connection
    raise err
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 81, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 162, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 146, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x7f2d88cef2b0>: Failed to establish a new connection: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 610, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 273, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.12.52', port=9000): Max retries exceeded with url: /webhdfs/v1/home/edhuser/testdata.txt?user.name=embs&offset=0&op=OPEN (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f2d88cef2b0>: Failed to establish a new connection: [Errno 113] No route to host',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "hdfs_read.py", line 3, in <module>
    with client.read('/home/edhuser/testdata.txt') as reader:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 678, in read
    buffersize=buffer_size,
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 118, in api_handler
    raise err
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 107, in api_handler
    **self.kwargs
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 207, in _request
    **kwargs
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 437, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='192.168.12.52', port=9000): Max retries exceeded with url: /webhdfs/v1/home/edhuser/testdata.txt?user.name=embs&offset=0&op=OPEN (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f2d88cef2b0>: Failed to establish a new connection: [Errno 113] No route to host',))

How can I fix this connection issue? Am I using a wrong port? I though the port that the namenode is using is specified in the core-site.xml, which I have shown above as stating 9000 for the port. In any case, I have tried all the default ports 50070, 8020, 8048 mentioned in the hadoop installation doc for various purposes, and I still get the same error. Instead of client = InsecureClient('http://192.168.12.52:9000'), should I be using client = InsecureClient('hdfs://192.168.12.52:9000'), or maybe client = InsecureClient('file:///192.168.12.52:9000'), or something like that? I have seen these elsewhere at various times.

I can access the HDFS in web by the way, as shown in the screenshot below:

enter image description here

Also, even if it connects successfully, I think I may not be giving the right file path (/opt/hadoop/README.txt). I gave this file path as this is what I see when I do search for the list of files and directories in the hadoop installation directory, which is /opt/hadoop:

$ ls /opt/hadoop/
bin                 lib          read_from_hdfs.py  write_to_hdfs_2.py
connect_to_hdfs.py  libexec      README.txt         write_to_hdfs3.py
etc                 LICENSE.txt  sbin               write_to_hdfs.py
hdfs_read_write.py  logs         share
include             NOTICE.txt   test_storage

But I know that HDFS is separate, and maybe I copied the contents of my HDFS into the local machine by doing hdfs dfs -get /test_storage/ ./ before, which is why its showing these files. But when I search for the files in the path of the namenode, it returns some illegible files:

$ls /opt/volume/namenode/current/
edits_0000000000000000001-0000000000000000002
edits_0000000000000000003-0000000000000000010
edits_0000000000000000011-0000000000000000012
edits_0000000000000000013-0000000000000000015
edits_0000000000000000016-0000000000000000023
edits_0000000000000000024-0000000000000000025
edits_0000000000000000026-0000000000000000032
edits_0000000000000000033-0000000000000000033
edits_0000000000000000034-0000000000000000035
edits_0000000000000000036-0000000000000000037
edits_0000000000000000038-0000000000000000039
edits_0000000000000000040-0000000000000000041
edits_0000000000000000042-0000000000000000043
edits_0000000000000000044-0000000000000000045
edits_0000000000000000046-0000000000000000047
edits_0000000000000000048-0000000000000000049
edits_0000000000000000050-0000000000000000051
edits_0000000000000000052-0000000000000000053
edits_0000000000000000054-0000000000000000055
edits_0000000000000000056-0000000000000000057
edits_0000000000000000058-0000000000000000059
edits_0000000000000000060-0000000000000000061
edits_0000000000000000062-0000000000000000063
edits_0000000000000000064-0000000000000000065
edits_0000000000000000066-0000000000000000067
edits_0000000000000000068-0000000000000000070
edits_0000000000000000071-0000000000000000072
edits_0000000000000000073-0000000000000000074
edits_0000000000000000075-0000000000000000076
edits_0000000000000000077-0000000000000000078
edits_inprogress_0000000000000000079
fsimage_0000000000000000076
fsimage_0000000000000000076.md5
fsimage_0000000000000000078
fsimage_0000000000000000078.md5
seen_txid
VERSION

So, if I am specifying the file path to read wrongly, what is the correct file path to use?

EDIT: Upon changing the port to 50070 (i.e., client = InsecureClient('http://192.168.12.52:50070')), I get the following error:

$ python3 hdfs_read_local.py 
Traceback (most recent call last):
  File "hdfs_read.py", line 3, in <module>
    with client.read('/opt/hadoop/LICENSE.txt') as reader:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 678, in read
    buffersize=buffer_size,
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 112, in api_handler
    raise err
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 107, in api_handler
    **self.kwargs
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 210, in _request
    _on_error(response)
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 50, in _on_error
    raise HdfsError(message, exception=exception)
hdfs.util.HdfsError: File /opt/hadoop/LICENSE.txt not found.

EDIT2: Upon modifying the file path from /opt/hadoop/LICENSE.txt to /test_storage/LICENSE.txt, which seems to be the correct HDFS path, Iand running the python script, I get the following error:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 137, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 91, in create_connection
    raise err
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 81, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 162, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 146, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x7f2e87867400>: Failed to establish a new connection: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 610, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 273, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='pr2.embs', port=50075): Max retries exceeded with url: /webhdfs/v1/test_storage/LICENSE.txt?op=OPEN&user.name=embs&namenoderpcaddress=192.168.12.52:9000&offset=0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f2e87867400>: Failed to establish a new connection: [Errno 113] No route to host',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "hdfs_read_local.py", line 3, in <module>
    with client.read('/test_storage/LICENSE.txt') as reader:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 678, in read
    buffersize=buffer_size,
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 118, in api_handler
    raise err
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 107, in api_handler
    **self.kwargs
  File "/home/embs/.local/lib/python3.6/site-packages/hdfs/client.py", line 207, in _request
    **kwargs
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 597, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 597, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 195, in resolve_redirects
    **adapter_kwargs
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 437, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='pr2.embs', port=50075): Max retries exceeded with url: /webhdfs/v1/test_storage/LICENSE.txt?op=OPEN&user.name=embs&namenoderpcaddress=192.168.12.52:9000&offset=0 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f2e87867400>: Failed to establish a new connection: [Errno 113] No route to host',))

Upvotes: 3

Views: 9764

Answers (4)

Sergii V.
Sergii V.

Reputation: 311

Hi I'm faced similar issue. It' looks like port is right. Im my case I was able to get list of dirs, but couldn't write any data. Issue was in my vpn which blocks some of the ports, and read and write uses different one.

Upvotes: 0

OneCricketeer
OneCricketeer

Reputation: 191743

http://192.168.12.52:9000

9000 is an RPC port. 50070 is the default HTTP WebHDFS port.

You might get No route to host if WebHDFS is disabled, or the datanode is not exposing port 50075 (the datanode http address) because it's down, or you changed that property

client.read('/opt/hadoop/LICENSE.txt')

You're running HDFS in pseudo distributed mode, but you're trying to read a local file. /opt does not exist in HDFS by default, and you've only ran a local ls... You should instead be using hadoop fs -ls /opt to see what files do exist at the path you're trying to open

But when I search for the files in the path of the namenode, it returns some illegible files:

Your files are not stored in the namenode... Their metadata is

Your files are stored in the datanode data directories, but as blocks, not as human-readable content

You can run this command to get a list of all the blocks and their locations

hdfs fsck /path/to/file.txt -files -blocks

Upvotes: 1

Marco99
Marco99

Reputation: 1659

There could be a problem with network configuration. Try this tweaked code for the time-being:

from hdfs import InsecureClient
client = InsecureClient('http://0.0.0.0:50070')
with client.read('/test-storage/LICENSE.txt') as reader:
    features = reader.read()
    print(features)

Read about IP Address 0.0.0.0

Upvotes: 0

PinoSan
PinoSan

Reputation: 1508

As stated here link this python library is using webhdfs. If you want to test both the host and the file path are correct you can use the following command curl -i 'http://192.168.12.52:50070/webhdfs/v1/<PATH>?op=LISTSTATUS'. This will list a directory in hdfs. if you get that right you can use the same "config" in python.

from hdfs import InsecureClient
client = InsecureClient('http://192.168.12.52:50070')
with client.read('<hdfs_path>') as reader:
    features = reader.read()
    print(features)

Upvotes: 1

Related Questions