Santosh Ghimire
Santosh Ghimire

Reputation: 3145

Scrapy xpath get text of an element that starts with <

I am trying to get text "<1 hour" from this html snippet.

<div class="details_wrapper">
<div class="detail">
    <b>Recommended length of visit:</b>
    <1 hour
</div>
<div class="detail">
    <b>Fee:</b>
    No
</div>
</div>

This is the xpath expression that I am using:

visit_length = response.xpath(
    "//div[@class='details_wrapper']/"
    "div[@class='detail']/b[contains(text(), "
    "'Recommended length of visit:')]/parent::div/text()"
).extract()

But it is not able to get the text. I think this is due to the "<" in the text that I need, it is being considered as a html tag. How can I scrape the text "<1 hour" ?

Upvotes: 4

Views: 9157

Answers (2)

eLRuLL
eLRuLL

Reputation: 18799

This is a lxml issue, as already reported on scrapy parser Parsel, check here the issue.

As it says in there, a solution would be to pass the type='xml' argument to a selector, your spider should be something like this:

from scrapy import Selector
...
...
    def your_parse_method(self, response):
        sel = Selector(text=response.body_as_unicode(), type='xml')
        # now use "sel" instead of response for getting xpath info
        ...
        visit_length = sel.xpath("//div[@class='details_wrapper']/"
            "div[@class='detail']/b[contains(text(), "
            "'Recommended length of visit:')]/parent::div/text()").extract()

Upvotes: 1

har07
har07

Reputation: 89325

Considering that Scrapy uses lxml under the hood, it might worth inspecting how lxml handles this kind of HTML, which contains XML special character < in one of the text nodes :

>>> from lxml import html
>>> raw = '''<div class="details_wrapper">
... <div class="detail">
...     <b>Recommended length of visit:</b>
...     <1 hour
... </div>
... <div class="detail">
...     <b>Fee:</b>
...     No
... </div>
... </div>'''
... 
>>> root = html.fromstring(raw)
>>> print html.tostring(root)
<div class="details_wrapper">
<div class="detail">
    <b>Recommended length of visit:</b>

<div class="detail">
    <b>Fee:</b>
    No
</div>
</div></div>

Notice in the above demo, as you suspected, text node '<1 hour' is gone completely from the root element source. As a workaround, consider using BeautifulSoup since it is more reasonable in handling this HTML case (you can pass response.body_as_unicode() to create the soup from Scrapy response) :

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(raw, "html.parser")
>>> print soup.prettify()
<div class="details_wrapper">
 <div class="detail">
  <b>
   Recommended length of visit:
  </b>
  &lt;1 hour
 </div>
 <div class="detail">
  <b>
   Fee:
  </b>
  No
 </div>
</div>

Finding the target text node using BS can be done as follow :

>>> soup.find('b', text='Recommended length of visit:').next_sibling
u'\n    <1 hour\n'

Upvotes: 2

Related Questions