Reputation: 1410
I am trying to parse xml file's using pyspark. My files are present in HDFS. I have written my code below but when i execute it, it is not able to identify the location. Please help - below is my code
Code:
import xml.etree.ElementTree as ET
filenme = sc.wholeTextFiles("/user/root/CDs")
def add_hrk(file):
tree = ET.parse(file)
doc = tree.getroot()
filenme.map(lambda(filename, content): filename).foreach(add_hrk)
Error:
IOError: [Errno 2] No such file or directory: u'hdfs://xxxx/user/root/CDs/Parsed_CD.xml'
I want to mention that i am using wholeTextFiles because in the function ET.parse i have to mention the path of the file that i am currently trying to process. Please suggest as i am not finding any solutions. I have checked the file exists in hdfs but it is throwing that error.
Upvotes: 0
Views: 825
Reputation: 13946
ElementTree
library expects files to be available on local filesystem. That is why you shold use rather fromstring, for example:
import xml.etree.ElementTree as ET
filenme = sc.wholeTextFiles("/user/root/CDs")
def add_hrk(content):
tree = ET.fromstring(content)
doc = tree.getroot()
filenme.map(lambda(filename, content): content).foreach(add_hrk)
Upvotes: 1