TheRichUncle
TheRichUncle

Reputation: 41

Can't access directory from HDFS inside a Python script

I have the following python script(I managed to run it locally):

#!/usr/bin/env python3

import folderstats

df = folderstats.folderstats('hdfs://quickstart.cloudera.8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)

df.to_csv(r'hdfs://quickstart.cloudera.8020/user/cloudera/files.csv', sep=',', index=True)

I have the directory: "files" in that location. I checked this through the command line and even with HUE, and it's there.

(myproject) [cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera
Found 1 items
drwxrwxrwx   - cloudera cloudera          0 2019-06-01 13:30 /user/cloudera/files

The problem is that the directory can't be accessed.

I tried to run it in my local terminal: python3 script.py and even with super-user like: sudo -u hdfs python3 script.py and the out says:

Traceback (most recent call last):
  File "script.py", line 5, in <module>
    df = folderstats.folderstats('hdfs://quickstart.cloudera:8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)
  File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 88, in folderstats
    verbose=verbose)
  File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 32, in _recursive_folderstats
    for f in os.listdir(folderpath):
FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'

Can you please help me clarify this issue?

Thank you!

Upvotes: 3

Views: 1783

Answers (1)

thePurplePython
thePurplePython

Reputation: 2767

Python runs on a single machine with a local linux (or windows) filesystem (FS).

Hadoop's HDFS project is a distributed file system setup across many machines (nodes).

There may be some custom class out there to read HDFS data in a single machine however I am not aware of any and it defeats the purpose of distributed computing.

You could either copy (source HDFS location => target Local FS location) your data from HDFS to local filesystem via hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/files /home/user/<target_directory_name> where Python lives or use something like Spark, Hive, or Impala to process/query the data.

If the data volume is quite small then copying the files from HDFS to Local FS to execute python script should be efficient for something like Cloudera Quickstart VM.

Upvotes: 2

Related Questions