Reputation: 347
I'm feeling dumb. Python & xpath newbie here. I'm trying to extract the complete text 'Open Box Price: $1079.99'
using xpath from
<div class="prod-price">
<p class="opbox-price">
<strong> Open Box Price:<br>$1079.99</strong>
</p>
<p class="orig-price">
Regular Price: <strong>$1499.98</strong>
</p>
</div>
But I can't. text stops at <br>
. Here's my code
doc = lxml.html.fromstring(r.content)
elements = doc.xpath(item_xpath)
print elements[1].find('div[3]/p[1]/text()[normalize-space()]')
Upvotes: 4
Views: 4176
Reputation: 243579
Just use, assuming the initial context (current node) is the parent of div
:
normalize-space(div/p[1]/strong)
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/*">
"<xsl:value-of select="normalize-space(div/p[1]/strong)"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the following XML document (the provided document corrected to be made well-formed and then enclosed in a top html
element):
<html>
<div class="prod-price">
<p class="opbox-price">
<strong> Open Box Price:<br />$1079.99</strong>
</p>
<p class="orig-price">
Regular Price:
<strong>$1499.98</strong>
</p>
</div>
</html>
the XPath expression is evaluated off the top element (html
) and the result of the evaluation is copied (enclosed in quotes) to the output:
"Open Box Price:$1079.99"
Upvotes: 1
Reputation: 142216
A basis for the XPath you want is using descendant-or-self
- tweak the result how you want:
>>> doc.xpath('//p[1]/descendant-or-self::text()')
['\n ', ' Open Box Price:', '$1079.99', '\n ']
>>> doc.xpath('//p[2]/descendant-or-self::text()')
['\n Regular Price: ', '$1499.98', '\n ']
Or as you're doing with lxml.html
, you could use text_content()
paras = doc.xpath('//p'): # or findall etc...
for para in paras:
print para.text_content()
Upvotes: 4