animal
animal

Reputation: 1004

How to read files in HDFS directory using python

I am trying to read files inside a directory in HDFS using Python. I used below code but i am getting error.

Code:

cat = Popen(["hadoop", "fs", "-cat", "/user/cloudera/CCMD"], stdout=PIPE)

Error:

cat: `/user/cloudera/CCMD': Is a directory
Traceback (most recent call last):
  File "hrkpat.py", line 6, in <module>
    tree = ET.parse(cat.stdout)
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 862, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 587, in parse
    self._root = parser.close()
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1254, in close
    self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0

Update:

I am having 10-15 xml files in my hdfs directory that i want to parse. I am able to parse the xml when only one xml is present in the directory but as soon as i am having multiple number of files i am not able to parse the xml. For this use case i want to write python code so that i can parse one file from my directory and once i parse it move to the next one.

Upvotes: 1

Views: 7391

Answers (2)

Ronak Patel
Ronak Patel

Reputation: 3849

you can use wildcard char * to read all files in dir:

hadoop fs -cat /user/cloudera/CCMD/*

Or just read xml files:

hadoop fs -cat /user/cloudera/CCMD/*.xml

Upvotes: 2

franklinsijo
franklinsijo

Reputation: 18270

Exception is cat: '/user/cloudera/CCMD': Is a directory

You are trying to perform a file operation over a directory. Pass the path of a file to the command.

Use this command in subprocess instead,

hadoop fs -cat /user/cloudera/CCMD/filename

Upvotes: 1

Related Questions