Stephen Strosko
Stephen Strosko

Reputation: 667

Extract all text from a certain xml tag using lxml

So I am looking to use lxml and python to grab all text from a tree that looks like this:

<Body>
  <X1>
    <Text>some text</Text>
    <Other>text I don't want</Other>
  </X1>
  <X2>
    <Text>some text</Text>
    <Other>text I don't want</Other>
  </X2>

The challenge here is that I want only the text that exist in the Text tag, but not the text that exists in other tags like Other. I need a way to iterate over all of the nodes within Body, and then combine the text from the nodes. This line of code gets me very close to what I want, but also picks up the text from the Other tags. So I need a way to weed out just the text I want.

text = "".join([x for x in root.find('.//Body').itertext()]).strip().replace('\n', '')

Again using the above tree and code the output is: "some text text I don't want some text text I don't want" Whereas I need: "some text some text"

Thanks for all of the help in advance!

Upvotes: 2

Views: 1369

Answers (1)

tdelaney
tdelaney

Reputation: 77407

A simple xpath statement should do

>>> text="""<Body>
...   <X1>
...     <Text>some text</Text>
...     <Other>text I don't want</Other>
...   </X1>
...   <X2>
...     <Text>some text</Text>
...     <Other>text I don't want</Other>
...   </X2>
... </Body>"""
>>> 
>>> import lxml.etree
>>> doc = lxml.etree.fromstring(text)
>>> ' '.join(e.text for e in doc.xpath('//Text'))
'some text some text'

Upvotes: 3

Related Questions