Reputation: 1471
I have this xml file:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
and I need to parse it to extract its text. I am using xml.etree.ElementTree
for this (see documentation).
This is the simple code I use to parse and explore the file:
import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()
def explore_element(element):
print(element.tag)
print(element.attrib)
print(element.text)
for child in element:
explore_element(child)
explore_element(root)
Things work as expected, except that element <P>
does not have the complete text. In particular, I seem to be missing "but then has some more stuff" (the text in <P>
that comes after the <af>
element).
The xml file is a given, so I cannot improve it, even if there is a better recommended way to write it (and there are too many to try to fix manually).
Is there a way I can get all the text?
The output that my code produces (in case it helps) is this:
do
{'title': 'Example document', 'date': 'today'}
db
{'descr': 'First level'}
P
{}
Some text here that
af
{'d': 'reference 1'}
continues
EDIT:
The accepted answer made me realize I had not read the documentation as closely as I should. People with related problems may also find .tail useful.
Upvotes: 1
Views: 993
Reputation: 16772
Using BeautifulSoup:
list_test.xml:
<do title='Example document' date='today'>
<db descr='First level'>
<P>
Some text here that
<af d='reference 1'>continues</af>
but then has some more stuff.
</P>
</db>
and then:
from bs4 import BeautifulSoup
with open('list_test.xml','r') as f:
soup = BeautifulSoup(f.read(), "html.parser")
for line in soup.find_all('p'):
print(line.text)
OUTPUT:
Some text here that
continues
but then has some more stuff.
EDIT:
Using elementree:
import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
OUTPUT:
Some text here that continues but then has some more stuff.
Upvotes: 2