Reputation: 26288
I have an xml file in which it is possible that the following occurs:
...
<a><b>This is</b> some text about <c>some</c> issue I have, parsing xml</a>
...
Edit: Let's assume, the tags could be nested more than only level, meaning
<a><b><c>...</c>...</b>...</a>
I came up with this using the python lxml.etree library.
context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("end",))
for event, element in context:
tag = element.tag
if tag == "a":
print element.text # is empty :/
mystring = element.xpath("string()")
...
But somehow it goes wrong.
What I want is the whole string
"This is some text about some issue I have, parsing xml"
But I only get an empty string. Any suggestions? Thanks!
Upvotes: 0
Views: 1835
Reputation: 5272
This question has been asked many times.
You can use lxml.html.text_content()
method.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
REF: Filter out HTML tags and resolve entities in python
OR use lxml.etree.strip_tags()
method.
REF: In lxml, how do I remove a tag but retain all contents?
Upvotes: 2