Reputation: 667
So I am looking to use lxml and python to grab all text from a tree that looks like this:
<Body>
<X1>
<Text>some text</Text>
<Other>text I don't want</Other>
</X1>
<X2>
<Text>some text</Text>
<Other>text I don't want</Other>
</X2>
The challenge here is that I want only the text that exist in the Text tag, but not the text that exists in other tags like Other. I need a way to iterate over all of the nodes within Body, and then combine the text from the nodes. This line of code gets me very close to what I want, but also picks up the text from the Other tags. So I need a way to weed out just the text I want.
text = "".join([x for x in root.find('.//Body').itertext()]).strip().replace('\n', '')
Again using the above tree and code the output is: "some text text I don't want some text text I don't want" Whereas I need: "some text some text"
Thanks for all of the help in advance!
Upvotes: 2
Views: 1369
Reputation: 77407
A simple xpath statement should do
>>> text="""<Body>
... <X1>
... <Text>some text</Text>
... <Other>text I don't want</Other>
... </X1>
... <X2>
... <Text>some text</Text>
... <Other>text I don't want</Other>
... </X2>
... </Body>"""
>>>
>>> import lxml.etree
>>> doc = lxml.etree.fromstring(text)
>>> ' '.join(e.text for e in doc.xpath('//Text'))
'some text some text'
Upvotes: 3