Léo Guillaume
Léo Guillaume

Reputation: 50

Retrieve text between multiple <br> in xml with python

Hello,

I have xml files composed as follows, I would like to retrieve text1, text2, text3 and text4.

<?xml version="1.0" encoding="UTF-8"?>
<TABLE>
<MAIN>
<CONTENU>
text1 <br/> text2  <br/> text3  <br/> text4
</CONTENU>
</MAIN>
</TABLE>

I've been stuck for days without finding a solution in the ElementTree doc. I have the following code but I only get the first text because of the <br/>. In addition the number of <br/> is variable from one file to another..

import xml.etree.ElementTree as ET

tree = ET.parse(file.xml))
root = tree.getroot()

for txt in root.iter('CONTENU'):
   print(txt)

>>> text1

How can I do that? Thanks in advance :)

Upvotes: 2

Views: 745

Answers (2)

dabingsou
dabingsou

Reputation: 2469

Another method.

from simplified_scrapy import SimplifiedDoc,utils,req
html = '''
<?xml version="1.0" encoding="UTF-8"?>
<TABLE>
<MAIN>
<CONTENU>
text1 <br/> text2  <br/> text3  <br/> text4
</CONTENU>
</MAIN>
</TABLE>
'''
doc = SimplifiedDoc(html)
texts = doc.select('CONTENU').getText(separator="|").split('|')
print (texts)

Upvotes: 0

trsvchn
trsvchn

Reputation: 8981

Try to use tail instead of text to get content after closing tag:

import xml.etree.ElementTree as ET

XML = """<?xml version="1.0" encoding="UTF-8"?>
<TABLE>
<MAIN>
<CONTENU>
text1 <br/> text2  <br/> text3  <br/> text4
</CONTENU>
</MAIN>
</TABLE>
"""

root = ET.fromstring(XML)

for txt in root.iter('CONTENU'):
    print(txt.text)
    for c in txt.iter():
        print(c.tail)

Output:


text1 


 text2  
 text3  
 text4

Upvotes: 2

Related Questions