Pablo
Pablo

Reputation: 1471

Extract xml text when elements in between text

I have this xml file:

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

and I need to parse it to extract its text. I am using xml.etree.ElementTree for this (see documentation).

This is the simple code I use to parse and explore the file:

import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
root = tree.getroot()

def explore_element(element):
    print(element.tag)
    print(element.attrib)
    print(element.text)
    for child in element:
        explore_element(child)

explore_element(root)

Things work as expected, except that element <P> does not have the complete text. In particular, I seem to be missing "but then has some more stuff" (the text in <P> that comes after the <af> element).

The xml file is a given, so I cannot improve it, even if there is a better recommended way to write it (and there are too many to try to fix manually).

Is there a way I can get all the text?

The output that my code produces (in case it helps) is this:

do
{'title': 'Example document', 'date': 'today'}

db
{'descr': 'First level'}

P 
{}
        Some text here that

af
{'d': 'reference 1'}
continues

EDIT:

The accepted answer made me realize I had not read the documentation as closely as I should. People with related problems may also find .tail useful.

Upvotes: 1

Views: 993

Answers (1)

DirtyBit
DirtyBit

Reputation: 16772

Using BeautifulSoup:

list_test.xml:

<do title='Example document' date='today'>
<db descr='First level'>
    <P>
        Some text here that
        <af d='reference 1'>continues</af>
        but then has some more stuff.
    </P>
</db>

and then:

from bs4 import BeautifulSoup

with open('list_test.xml','r') as f:
    soup = BeautifulSoup(f.read(), "html.parser")
    for line in soup.find_all('p'):
         print(line.text)

OUTPUT:

Some text here that
continues
but then has some more stuff.

EDIT:

Using elementree:

import xml.etree.ElementTree as ET
xml = '<p> Some text here that <af d="reference 1">continues</af> but then has some more stuff.</p>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

OUTPUT:

Some text here that continues but then has some more stuff.

Upvotes: 2

Related Questions