XPath returns empty list. Why is it ignoring targeted div element?

Question

I'm a newbee to XPath and Scrapy. I'm trying to target a node which does not have a unique class (i.e. class="pubBody").

Already tried: xpath not contains A and B

This should be a simple task but XPath just misses the second item. I am doing this from the scrapy shell. On the command prompt:

scrapy shell "http://www.sciencedirect.com/science/journal/00221694/"

I am looking for the second div:

...< /div>

... < /div>

I can only get the first but not the second. The best answers to similar questions suggested trying something like:

hxs.xpath('//div[contains(@class,"pubBody") and not(contains(@id,"issueListHeader"))]')

but this returns an empty list for some reason. Any help please? Must be missing something silly, I've tried this for days!

Other details:

Once in the scrapy shell:

import scrapy

xs = scrapy.Selector(response)

hxs.xpath('//div[@class="pubBody"]')

Which works only for the first div element:

[]

For the failed second div element I've also tried:

hxs.xpath('//div[@class="pubBody" and not(@id="issueListHeader")]').extract_first()

hxs.xpath('//div[starts-with(@class, "pubBody") and not(re:test(@id, "issueListHeader"))]')

Also directly copied the XPath from Chrome, but also returns '[]':

hxs.xpath('//*[@id="issueList"]/div/form/div[2]')

alecxe · Accepted Answer

The problem is that the HTML is very far from being well-formed on this page. To demonstrate, see how the same exact CSS selector is producing 0 results with Scrapy and producing 94 in BeautifulSoup:

In [1]: from bs4 import BeautifulSoup

In [2]: soup = BeautifulSoup(response.body, 'html5lib')  # note: "html5lib" has to be installed

In [3]: len(soup.select(".article h4 a"))
Out[3]: 94

In [4]: len(response.css(".article h4 a"))
Out[4]: 0

Same goes for the pubBody element you are trying to locate:

In [6]: len(response.css(".pubBody"))
Out[6]: 1

In [7]: len(soup.select(".pubBody"))
Out[7]: 2

So, try hooking up BeautifulSoup to fix/clean up the HTML - ideally through a middleware.

I've created a simple scrapy_beautifulsoup middleware to easily hook up into the project:

install it via pip:
```
pip install scrapy-beautifulsoup
```

configure the middleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 543
}
BEAUTIFULSOUP_PARSER = "html5lib"

Profit.

XPath returns empty list. Why is it ignoring targeted div element?

Answers (2)

Related Questions