Parsing
tags using LXML in Python

Question

I am parsing a TEI XML file using LXML in Python 3.5. For some reason I don't understand, the parser breaks the

tag contents wherever there are nested tags.

This is my code:

from lxml import etree
namespaces = {'tei':'http://www.tei-c.org/ns/1.0'}
xp_p = "//tei:body//tei:p//text()"
tree = etree.parse("data/sorb.xml")
paragraphs = tree.xpath(xp_p, namespaces=namespaces)
for par in paragraphs:
    print(par)

So, for example, if I have a

in the XML file like so:


  Circa distinctionem 3m quaero utrum mens humana
  sit ymagoimago trinitatis increatae sicudsicut in rebus aliis factis propter hominem est vestigium eiusdem trinitatis

my script breaks its contents thus:

Circa distinctionem 3m quaero utrum mens humana

sit 
ymago
imago
 trinitatis increatae 
sicud
sicut
 in rebus a
liis
                factis propter hominem est vestigium eiusdem tri
nitatis

Whereas I'm seeking to get the whole

thus:

Circa distinctionem 3m quaero utrum mens humana sit ymago imago trinitatis increatae sicud sicut in rebus a liis factis propter hominem est vestigium eiusdem tri nitatis

Part 1 of my question is, What's going on, and how can I solve my problem?

Part 2 of my question would be, how can I get this other result?

Circa distinctionem 3m quaero utrum mens humana sit ymagoimago trinitatis increatae sicudsicut in rebus aliis factis propter hominem est vestigium eiusdem trinitatis

(i.e., the whole content of

)?

larsks · Accepted Answer

Part 1 of my question is, What's going on, and how can I solve my problem?

Your xpath expression is explicitly requesting text nodes:

 //tei:body//tei:p//text()

So what you get back is a list of text nodes contained within the

element.

Part 2 of my question would be, how can I get this other result?

You're probably going to want to iterate over the

elements themselves, rather than the text nodes:

xp_p = "//tei:body//tei:p"

Then within your loop, use the xpath string function:

for par in paragraphs:
  text = par.xpath('string(.)')

Which would give you:

'
  Circa distinctionem 3m quaero utrum mens humana
  sit ymagoimago trinitatis increatae sicudsicut in rebus aliis factis propter hominem est vestigium eiusdem trinitatis
'

You could get to a similar result like this:

text = ' '.join(x.strip() for x in par.xpath('.//text()'))

...which would have the advantage of converting all the newlines into spaces, so you would end up with:

' Circa distinctionem 3m quaero utrum mens humana sit ymago imago trinitatis increatae sicud sicut in rebus a liis factis propter hominem est vestigium eiusdem tri nitatis'

If rather than the text you actually want the entire HTML content contained in the

element, see this answer. The solution would look something like this:

innerhtml = ''.join(etree.tostring(child) for child in par.iterdescendants())

And the result would look like:

'Circa distinctionem 3m quaero utrum mens humana
  sit ymagoimago trinitatis increatae ymagoimagosicudsicut in rebus asicudsicutliis factis propter hominem est vestigium eiusdem trinitatis
'

Parsing <p> tags using LXML in Python

Answers (1)

Related Questions

Parsing &lt;p&gt; tags using LXML in Python

Answers (1)

Related Questions

Parsing <p> tags using LXML in Python