Reputation: 3145
I am trying to get text "<1 hour" from this html snippet.
<div class="details_wrapper">
<div class="detail">
<b>Recommended length of visit:</b>
<1 hour
</div>
<div class="detail">
<b>Fee:</b>
No
</div>
</div>
This is the xpath expression that I am using:
visit_length = response.xpath(
"//div[@class='details_wrapper']/"
"div[@class='detail']/b[contains(text(), "
"'Recommended length of visit:')]/parent::div/text()"
).extract()
But it is not able to get the text. I think this is due to the "<" in the text that I need, it is being considered as a html tag. How can I scrape the text "<1 hour" ?
Upvotes: 4
Views: 9157
Reputation: 18799
This is a lxml
issue, as already reported on scrapy
parser Parsel
, check here the issue.
As it says in there, a solution would be to pass the type='xml'
argument to a selector, your spider should be something like this:
from scrapy import Selector
...
...
def your_parse_method(self, response):
sel = Selector(text=response.body_as_unicode(), type='xml')
# now use "sel" instead of response for getting xpath info
...
visit_length = sel.xpath("//div[@class='details_wrapper']/"
"div[@class='detail']/b[contains(text(), "
"'Recommended length of visit:')]/parent::div/text()").extract()
Upvotes: 1
Reputation: 89325
Considering that Scrapy uses lxml
under the hood, it might worth inspecting how lxml
handles this kind of HTML, which contains XML special character <
in one of the text nodes :
>>> from lxml import html
>>> raw = '''<div class="details_wrapper">
... <div class="detail">
... <b>Recommended length of visit:</b>
... <1 hour
... </div>
... <div class="detail">
... <b>Fee:</b>
... No
... </div>
... </div>'''
...
>>> root = html.fromstring(raw)
>>> print html.tostring(root)
<div class="details_wrapper">
<div class="detail">
<b>Recommended length of visit:</b>
<div class="detail">
<b>Fee:</b>
No
</div>
</div></div>
Notice in the above demo, as you suspected, text node '<1 hour'
is gone completely from the root
element source. As a workaround, consider using BeautifulSoup
since it is more reasonable in handling this HTML case (you can pass response.body_as_unicode()
to create the soup
from Scrapy response) :
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(raw, "html.parser")
>>> print soup.prettify()
<div class="details_wrapper">
<div class="detail">
<b>
Recommended length of visit:
</b>
<1 hour
</div>
<div class="detail">
<b>
Fee:
</b>
No
</div>
</div>
Finding the target text node using BS can be done as follow :
>>> soup.find('b', text='Recommended length of visit:').next_sibling
u'\n <1 hour\n'
Upvotes: 2