Reputation: 720
I have page:
<body>
<div>
<a id="123">text_url</a>
</div>
<body>
And I want to get element '//div/a' as plain html text.
<a id="123">text_url</a>
How can I do it?
Upvotes: 2
Views: 7185
Reputation: 168646
If you have already parsed the object using lxml
, you can serialize it with lxml.etree.tostring()
:
from lxml import etree
xml='''<body>
<div>
<a id="123">text_url</a>
</div>
</body>'''
root = etree.fromstring(xml)
for a in root.xpath('//div/a'):
print etree.tostring(a, method='html', with_tail=False)
Upvotes: 2
Reputation: 521
You could use the xml library in Python.
from xml.etree.ElementTree import parse
doc = parse('page.xml') # assuming page.xml is on disk
print doc.find('div/a[@id="123"]').text
Note that this would only work for strict XML. For example, you closing body tag is incorrect and this code would fail in that case. HTML on the web is rarely strict XML.
Upvotes: 0
Reputation: 67968
You can use re module of python with re.findall.
import re
print re.findall(r".*?(<a.*?<\/a>).*",x,re.DOTALL)
where x is x=""" text_url """
Output:['<a id="123">text_url</a>']
See demo as well.
http://regex101.com/r/lF4lY6/1
Upvotes: 0
Reputation: 720
Working solution in python with grab module.
from grab import Grab
g = Grab()
g.go('file://page.htm')
print g.doc.select('//div/a')[0].html()
>><a id="123">text_url</a>
Upvotes: 0