Reputation: 13
I want to convert XML files into a CSV file. My XML file consists of different tags and I select some of them that are useful for my work. I want to access only text content between TEXT tags. My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child, when I run my code it just parses the IMAGE tag and shows NaN when I read my CSV file with pandas. I searched about CDATA but I can't find any tag for it to tell the parser that skips IMAGE tag and extract only content in the CDATA section. Also, I tried to delete IMAGE tags from TEXT to fix the problem but when I did that, it deleted all of the TEXT content, also the CDATA section.
My XML pattern is as follow:
<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
</DOC>
</root>
And, Here is my parsing code:
def make_csv(folderpath, xmlfilename, csvwriter, csv_file):
rows = []
#Parse XML file
tree = ET.parse(os.path.join(folderpath, xmlfilename))
root = tree.getroot()
for elem in root.findall("DOC") :
rows = []
sentence = elem.find("TEXT")
if sentence != None:
sentence = re.sub('\n', '', sent.text)
rows.append(sentence)
csvwriter.writerow(rows)
csv_file.close()
I appreciate any help.
Upvotes: 1
Views: 906
Reputation: 23815
My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child
The below seems to work. The code handle the cases of IMAGE under TEXT and TEXT with no IMAGE under it.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
<TEXT>
<![CDATA[more text]]>
</TEXT>
</DOC></root>'''
root = ET.fromstring(xml)
texts = root.findall('.//TEXT')
for idx, text in enumerate(texts, start=1):
data = list(text)[0].tail.strip() if list(text) else text.text.strip()
print(f'{idx}) {data}')
output
1) The section I want to access to
2) more text
Upvotes: 1