Reputation: 50
Hello,
I have xml files composed as follows, I would like to retrieve text1, text2, text3 and text4.
<?xml version="1.0" encoding="UTF-8"?>
<TABLE>
<MAIN>
<CONTENU>
text1 <br/> text2 <br/> text3 <br/> text4
</CONTENU>
</MAIN>
</TABLE>
I've been stuck for days without finding a solution in the ElementTree doc. I have the following code but I only get the first text because of the <br/>
. In addition the number of <br/>
is variable from one file to another..
import xml.etree.ElementTree as ET
tree = ET.parse(file.xml))
root = tree.getroot()
for txt in root.iter('CONTENU'):
print(txt)
>>> text1
How can I do that? Thanks in advance :)
Upvotes: 2
Views: 745
Reputation: 2469
Another method.
from simplified_scrapy import SimplifiedDoc,utils,req
html = '''
<?xml version="1.0" encoding="UTF-8"?>
<TABLE>
<MAIN>
<CONTENU>
text1 <br/> text2 <br/> text3 <br/> text4
</CONTENU>
</MAIN>
</TABLE>
'''
doc = SimplifiedDoc(html)
texts = doc.select('CONTENU').getText(separator="|").split('|')
print (texts)
Upvotes: 0
Reputation: 8981
Try to use tail
instead of text
to get content after closing tag:
import xml.etree.ElementTree as ET
XML = """<?xml version="1.0" encoding="UTF-8"?>
<TABLE>
<MAIN>
<CONTENU>
text1 <br/> text2 <br/> text3 <br/> text4
</CONTENU>
</MAIN>
</TABLE>
"""
root = ET.fromstring(XML)
for txt in root.iter('CONTENU'):
print(txt.text)
for c in txt.iter():
print(c.tail)
Output:
text1
text2
text3
text4
Upvotes: 2