Jason Wirth
Jason Wirth

Reputation: 763

How to get text for a root element using lxml?

I'm completely stumped why lxml .text will give me the text for a child tag but for the root tag.

some_tag = etree.fromstring('<some_tag class="abc"><strong>Hello</strong> World</some_tag>')

some_tag.find("strong")
Out[195]: <Element strong at 0x7427d00>

some_tag.find("strong").text
Out[196]: 'Hello'

some_tag
Out[197]: <Element some_tag at 0x7bee508>

some_tag.text

some_tag.find("strong").text returns the text between the <strong> tag.

I expect some_tag.text to return everything between <some_tag> ... </some_tag>

Expected:

<strong>Hello</strong> World

Instead, it returns nothing.

Upvotes: 8

Views: 10682

Answers (5)

sunny singh
sunny singh

Reputation: 33

You have to use inbuilt lxml method to retrieve all the text between the tag.

  from lxml import etree
  xml='''<some_tag class="abc"><strong>Hello</strong> World</some_tag>'''
  tree = etree.fromstring(xml)
  print(''.join(tree.xpath('//text()')))

Upvotes: 0

mzjn
mzjn

Reputation: 51042

from lxml import etree

XML = '<some_tag class="abc"><strong>Hello</strong> World</some_tag>'

some_tag = etree.fromstring(XML)

for element in some_tag:
    print element.tag, element.text, element.tail

Output:

strong Hello  World

For information on the .text and .tail properties, see:

To get exactly the result that you expected, use

print etree.tostring(some_tag.find("strong"))

Output:

<strong>Hello</strong> World

Upvotes: 10

daedalus
daedalus

Reputation: 10923

Does this help?

comp = [ etree.tostring(e) for e in some_tag]
print ''.join(comp[0])

EDITED: Thanks @mzjin for putting me on the right track

Upvotes: 0

Matthias
Matthias

Reputation: 13232

You'll find the missing text here

>>> some_tag.find("strong").tail
' World'

Look at http://lxml.de/tutorial.html and search for "tail".

Upvotes: 1

Thomas Leduc
Thomas Leduc

Reputation: 1100

I'm not sure to understand your question but you have 2 main solutions in parsing :

DOMParser : depending the langage, it's node.getNodeValue();

SAXParser : depending the langage, but in java for example is in the fonction : characters(...)

I haven't the time to search on google but in python, I know MiniDOM (a DOM parser) : http://www.blog.pythonlibrary.org/2010/11/12/python-parsing-xml-with-minidom/

I hope my answer can help you.

Upvotes: 0

Related Questions