zonked.zonda
zonked.zonda

Reputation: 347

Xpath normalize-space

I'm feeling dumb. Python & xpath newbie here. I'm trying to extract the complete text 'Open Box Price: $1079.99' using xpath from

<div class="prod-price">
<p class="opbox-price">
    <strong> Open Box Price:<br>$1079.99</strong>
    </p>
<p class="orig-price">
    Regular Price: <strong>$1499.98</strong>
    </p>
</div>

But I can't. text stops at <br>. Here's my code

doc = lxml.html.fromstring(r.content)
elements = doc.xpath(item_xpath)
print elements[1].find('div[3]/p[1]/text()[normalize-space()]')

Upvotes: 4

Views: 4176

Answers (2)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243579

Just use, assuming the initial context (current node) is the parent of div:

normalize-space(div/p[1]/strong)

XSLT - based verification:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>

 <xsl:template match="/*">
     "<xsl:value-of select="normalize-space(div/p[1]/strong)"/>"
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the following XML document (the provided document corrected to be made well-formed and then enclosed in a top html element):

<html>
    <div class="prod-price">
        <p class="opbox-price">
          <strong> Open Box Price:<br />$1079.99</strong>
        </p>
        <p class="orig-price">
    Regular Price: 
            <strong>$1499.98</strong>
        </p>
    </div>
</html>

the XPath expression is evaluated off the top element (html) and the result of the evaluation is copied (enclosed in quotes) to the output:

"Open Box Price:$1079.99"

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142216

A basis for the XPath you want is using descendant-or-self - tweak the result how you want:

>>> doc.xpath('//p[1]/descendant-or-self::text()')
['\n    ', ' Open Box Price:', '$1079.99', '\n    ']
>>> doc.xpath('//p[2]/descendant-or-self::text()')
['\n    Regular Price: ', '$1499.98', '\n    ']

Or as you're doing with lxml.html, you could use text_content()

paras = doc.xpath('//p'): # or findall etc...
for para in paras:
    print para.text_content()

Upvotes: 4

Related Questions