Reputation: 447
I'm a newbee to XPath and Scrapy. I'm trying to target a node which does not have a unique class (i.e. class="pubBody"
).
Already tried: xpath not contains A and B
This should be a simple task but XPath just misses the second item. I am doing this from the scrapy shell. On the command prompt:
scrapy shell "http://www.sciencedirect.com/science/journal/00221694/"
I am looking for the second div:
<div id="issueListHeader" class="pubBody">...< /div>
<div class="pubBody">... < /div>
I can only get the first but not the second. The best answers to similar questions suggested trying something like:
hxs.xpath('//div[contains(@class,"pubBody") and not(contains(@id,"issueListHeader"))]')
but this returns an empty list for some reason. Any help please? Must be missing something silly, I've tried this for days!
Other details:
Once in the scrapy shell:
import scrapy
xs = scrapy.Selector(response)
hxs.xpath('//div[@class="pubBody"]')
Which works only for the first div element:
[<Selector xpath='//div[@class="pubBody"]' data='<div id="issueListHeader" class="pubBody'>]
For the failed second div element I've also tried:
hxs.xpath('//div[@class="pubBody" and not(@id="issueListHeader")]').extract_first()
hxs.xpath('//div[starts-with(@class, "pubBody") and not(re:test(@id, "issueListHeader"))]')
Also directly copied the XPath from Chrome, but also returns '[]':
hxs.xpath('//*[@id="issueList"]/div/form/div[2]')
Upvotes: 1
Views: 1156
Reputation: 474191
The problem is that the HTML is very far from being well-formed on this page. To demonstrate, see how the same exact CSS selector is producing 0 results with Scrapy and producing 94 in BeautifulSoup
:
In [1]: from bs4 import BeautifulSoup
In [2]: soup = BeautifulSoup(response.body, 'html5lib') # note: "html5lib" has to be installed
In [3]: len(soup.select(".article h4 a"))
Out[3]: 94
In [4]: len(response.css(".article h4 a"))
Out[4]: 0
Same goes for the pubBody
element you are trying to locate:
In [6]: len(response.css(".pubBody"))
Out[6]: 1
In [7]: len(soup.select(".pubBody"))
Out[7]: 2
So, try hooking up BeautifulSoup
to fix/clean up the HTML - ideally through a middleware.
I've created a simple scrapy_beautifulsoup
middleware to easily hook up into the project:
install it via pip:
pip install scrapy-beautifulsoup
configure the middleware in settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 543
}
BEAUTIFULSOUP_PARSER = "html5lib"
Profit.
Upvotes: 1
Reputation: 1273
I suspect the problem is that the source for the page you'r trying to parse (http://www.sciencedirect.com/science/journal/00221694/) is not valid XML due to the <link ...>
nodes/elements/tags not having closing tags. There may be other problems, but those are the first ones I found.
I'm rusty on Javascript, but you may try navigating down the DOM down to a lower level in the page (ie. body or some other node closer to the elements you're trying to target) and then perform the XPath from that level.
UPDATE: I just tried removing the <head>
of the document and passing it through an XML parser and it still breaks on sever <input>
nodes that are not closed. Unless I'm forgetting some special JavaScript XML/XPath rules methods that dismiss closing tags I suspect you might be better suited to use something like JQuery to find the elements you're looking for.
Upvotes: 0