Pyspark: No such file or directory in hdfs

Question

I am trying to parse xml file's using pyspark. My files are present in HDFS. I have written my code below but when i execute it, it is not able to identify the location. Please help - below is my code

Code:

import xml.etree.ElementTree as ET
filenme = sc.wholeTextFiles("/user/root/CDs")
def add_hrk(file):
   tree = ET.parse(file)
   doc = tree.getroot()
filenme.map(lambda(filename, content): filename).foreach(add_hrk)

Error:

IOError: [Errno 2] No such file or directory: u'hdfs://xxxx/user/root/CDs/Parsed_CD.xml'

I want to mention that i am using wholeTextFiles because in the function ET.parse i have to mention the path of the file that i am currently trying to process. Please suggest as i am not finding any solutions. I have checked the file exists in hdfs but it is throwing that error.

Mariusz · Accepted Answer

ElementTree library expects files to be available on local filesystem. That is why you shold use rather fromstring, for example:

import xml.etree.ElementTree as ET
filenme = sc.wholeTextFiles("/user/root/CDs")
def add_hrk(content):
   tree = ET.fromstring(content)
   doc = tree.getroot()

filenme.map(lambda(filename, content): content).foreach(add_hrk)

Pyspark: No such file or directory in hdfs

Answers (1)

Related Questions