Scrapy xpath get text of an element that starts with

Question

I am trying to get text "<1 hour" from this html snippet.



    Recommended length of visit:
    <1 hour


    Fee:
    No

This is the xpath expression that I am using:

visit_length = response.xpath(
    "//div[@class='details_wrapper']/"
    "div[@class='detail']/b[contains(text(), "
    "'Recommended length of visit:')]/parent::div/text()"
).extract()

But it is not able to get the text. I think this is due to the "<" in the text that I need, it is being considered as a html tag. How can I scrape the text "<1 hour" ?

har07 · Accepted Answer

Considering that Scrapy uses lxml under the hood, it might worth inspecting how lxml handles this kind of HTML, which contains XML special character < in one of the text nodes :

>>> from lxml import html
>>> raw = '''
... 
...     Recommended length of visit:
...     <1 hour
... 
... 
...     Fee:
...     No
... 
... '''
... 
>>> root = html.fromstring(raw)
>>> print html.tostring(root)


    Recommended length of visit:


    Fee:
    No

Notice in the above demo, as you suspected, text node '<1 hour' is gone completely from the root element source. As a workaround, consider using BeautifulSoup since it is more reasonable in handling this HTML case (you can pass response.body_as_unicode() to create the soup from Scrapy response) :

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(raw, "html.parser")
>>> print soup.prettify()

 
  
   Recommended length of visit:
  
  <1 hour
 
 
  
   Fee:
  
  No

Finding the target text node using BS can be done as follow :

>>> soup.find('b', text='Recommended length of visit:').next_sibling
u'
    <1 hour
'

Scrapy xpath get text of an element that starts with <

Answers (2)

Related Questions

Scrapy xpath get text of an element that starts with &lt;

Answers (2)

Related Questions

Scrapy xpath get text of an element that starts with <