MITHU
MITHU

Reputation: 154

Can't parse a certain information from some html elements using xpath

I've created an xpath expression to target an element so that I can extract a certain information out of some html elements using xpath within scrapy. I can't reach it anyway.

Html elements:

<div class="rates">
                <label>
                  Rates :
                </label>
                  R 3500
                  <br class="hidden-md hidden-lg">
              </div>

I wish to extract R 3500 out of it.

I've tried with:

from scrapy import Selector

html = """
<div class="rates">
                <label>
                  Rates :
                </label>
                  R 3500
                  <br class="hidden-md hidden-lg">
              </div>
"""
sel = Selector(text=html)
rate = sel.xpath("//*[@class='rates']/label/following::*").get()
print(rate)

Upon running my above script this is what I'm getting <br class="hidden-md hidden-lg"> whereas I wish to get R 3500.

I could have used .tail if opted for lxml. However, when I go for scrapy I don't find anything similar.

How can I extract that rate out of the html elements using xpath?

Upvotes: 2

Views: 173

Answers (2)

Mathias M&#252;ller
Mathias M&#252;ller

Reputation: 22617

To complement the accepted answer, which is entirely correct, here is an explanation why

//*[@class='rates']/label/following::*

given the document

<div class="rates">
   <label>
   Rates :
   </label>
   R 3500
   <br class="hidden-md hidden-lg">
</div>

does not return the text R 3500: * only selects element nodes that follow after label elements, but not text nodes. Elements and text nodes are different concepts in the XPath document model. You can test this claim with a slightly different document:

<div class="rates">
   <label>
   Rates :
   </label>
   <any>R 3500</any>
   <br class="hidden-md hidden-lg">
</div>

Which causes your code to return the any element.

Both text() (more specific) and node() (more general) select this text node, and in this case both the following:: and following-sibling:: axes work.

Upvotes: 1

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

To get a text node as a following-sibling after the label node:

...
sel = Selector(text=html)
rate = sel.xpath("//*[@class='rates']/label/following-sibling::text()").get().strip()
print(rate)

The output:

R 3500

Addition: "//*[@class='rates']/label/following::text()" should also work.

https://www.w3.org/TR/1999/REC-xpath-19991116#axes

Upvotes: 3

Related Questions