Reputation: 41
I have the following python script(I managed to run it locally):
#!/usr/bin/env python3
import folderstats
df = folderstats.folderstats('hdfs://quickstart.cloudera.8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)
df.to_csv(r'hdfs://quickstart.cloudera.8020/user/cloudera/files.csv', sep=',', index=True)
I have the directory: "files" in that location. I checked this through the command line and even with HUE, and it's there.
(myproject) [cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera
Found 1 items
drwxrwxrwx - cloudera cloudera 0 2019-06-01 13:30 /user/cloudera/files
The problem is that the directory can't be accessed.
I tried to run it in my local terminal: python3 script.py and even with super-user like: sudo -u hdfs python3 script.py and the out says:
Traceback (most recent call last):
File "script.py", line 5, in <module>
df = folderstats.folderstats('hdfs://quickstart.cloudera:8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)
File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 88, in folderstats
verbose=verbose)
File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 32, in _recursive_folderstats
for f in os.listdir(folderpath):
FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'
Can you please help me clarify this issue?
Thank you!
Upvotes: 3
Views: 1783
Reputation: 2767
Python runs on a single machine with a local linux (or windows) filesystem (FS).
Hadoop's HDFS project is a distributed file system setup across many machines (nodes).
There may be some custom class out there to read HDFS data in a single machine however I am not aware of any and it defeats the purpose of distributed computing.
You could either copy (source HDFS location => target Local FS location) your data from HDFS to local filesystem via hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/files /home/user/<target_directory_name>
where Python lives or use something like Spark, Hive, or Impala to process/query the data.
If the data volume is quite small then copying the files from HDFS to Local FS to execute python script should be efficient for something like Cloudera Quickstart VM.
Upvotes: 2