Extract all text from a certain xml tag using lxml

Question

So I am looking to use lxml and python to grab all text from a tree that looks like this:


  
    some text
    text I don't want
  
  
    some text
    text I don't want

The challenge here is that I want only the text that exist in the Text tag, but not the text that exists in other tags like Other. I need a way to iterate over all of the nodes within Body, and then combine the text from the nodes. This line of code gets me very close to what I want, but also picks up the text from the Other tags. So I need a way to weed out just the text I want.

text = "".join([x for x in root.find('.//Body').itertext()]).strip().replace('
', '')

Again using the above tree and code the output is: "some text text I don't want some text text I don't want" Whereas I need: "some text some text"

Thanks for all of the help in advance!

tdelaney · Accepted Answer

A simple xpath statement should do

>>> text="""
...   
...     some text
...     text I don't want
...   
...   
...     some text
...     text I don't want
...   
... """
>>> 
>>> import lxml.etree
>>> doc = lxml.etree.fromstring(text)
>>> ' '.join(e.text for e in doc.xpath('//Text'))
'some text some text'

Extract all text from a certain xml tag using lxml

Answers (1)

Related Questions