NVaughan
NVaughan

Reputation: 1665

Parsing <p> tags using LXML in Python

I am parsing a TEI XML file using LXML in Python 3.5. For some reason I don't understand, the parser breaks the <p> tag contents wherever there are nested tags.

This is my code:

from lxml import etree
namespaces = {'tei':'http://www.tei-c.org/ns/1.0'}
xp_p = "//tei:body//tei:p//text()"
tree = etree.parse("data/sorb.xml")
paragraphs = tree.xpath(xp_p, namespaces=namespaces)
for par in paragraphs:
    print(par)

So, for example, if I have a <p> in the XML file like so:

<p xml:id="b1d3qun-cdtvet">
  <lb ed="#S"/>Circa distinctionem 3m quaero utrum mens humana
  <lb ed="#S"/>sit <choice><orig>ymago</orig><reg>imago</reg></choice> trinitatis increatae <choice><orig>sicud</orig><reg>sicut</reg></choice> in rebus a<lb ed="#S"/>liis factis propter hominem est vestigium eiusdem tri<lb ed="#S"/>nitatis
</p>

my script breaks its contents thus:

Circa distinctionem 3m quaero utrum mens humana

sit 
ymago
imago
 trinitatis increatae 
sicud
sicut
 in rebus a
liis
                factis propter hominem est vestigium eiusdem tri
nitatis

Whereas I'm seeking to get the whole <p> thus:

Circa distinctionem 3m quaero utrum mens humana sit ymago imago trinitatis increatae sicud sicut in rebus a liis factis propter hominem est vestigium eiusdem tri nitatis

Part 1 of my question is, What's going on, and how can I solve my problem?

Part 2 of my question would be, how can I get this other result?

Circa distinctionem 3m quaero utrum mens humana sit ymagoimago trinitatis increatae sicudsicut in rebus aliis factis propter hominem est vestigium eiusdem trinitatis

(i.e., the whole content of <p>)?

Upvotes: 1

Views: 297

Answers (1)

larsks
larsks

Reputation: 311721

Part 1 of my question is, What's going on, and how can I solve my problem?

Your xpath expression is explicitly requesting text nodes:

 //tei:body//tei:p//text()

So what you get back is a list of text nodes contained within the <p> element.

Part 2 of my question would be, how can I get this other result?

You're probably going to want to iterate over the <p> elements themselves, rather than the text nodes:

xp_p = "//tei:body//tei:p"

Then within your loop, use the xpath string function:

for par in paragraphs:
  text = par.xpath('string(.)')

Which would give you:

'\n  Circa distinctionem 3m quaero utrum mens humana\n  sit ymagoimago trinitatis increatae sicudsicut in rebus aliis factis propter hominem est vestigium eiusdem trinitatis\n'

You could get to a similar result like this:

text = ' '.join(x.strip() for x in par.xpath('.//text()'))

...which would have the advantage of converting all the newlines into spaces, so you would end up with:

' Circa distinctionem 3m quaero utrum mens humana sit ymago imago trinitatis increatae sicud sicut in rebus a liis factis propter hominem est vestigium eiusdem tri nitatis'

If rather than the text you actually want the entire HTML content contained in the <p> element, see this answer. The solution would look something like this:

innerhtml = ''.join(etree.tostring(child) for child in par.iterdescendants())

And the result would look like:

'<lb xmlns="http://www.tei-c.org/ns/1.0" ed="#S"/>Circa distinctionem 3m quaero utrum mens humana\n  <lb xmlns="http://www.tei-c.org/ns/1.0" ed="#S"/>sit <choice xmlns="http://www.tei-c.org/ns/1.0"><orig>ymago</orig><reg>imago</reg></choice> trinitatis increatae <orig xmlns="http://www.tei-c.org/ns/1.0">ymago</orig><reg xmlns="http://www.tei-c.org/ns/1.0">imago</reg><choice xmlns="http://www.tei-c.org/ns/1.0"><orig>sicud</orig><reg>sicut</reg></choice> in rebus a<orig xmlns="http://www.tei-c.org/ns/1.0">sicud</orig><reg xmlns="http://www.tei-c.org/ns/1.0">sicut</reg><lb xmlns="http://www.tei-c.org/ns/1.0" ed="#S"/>liis factis propter hominem est vestigium eiusdem tri<lb xmlns="http://www.tei-c.org/ns/1.0" ed="#S"/>nitatis\n'

Upvotes: 4

Related Questions