Spider not scraping the right amount of items

Question

I have been learning Scrapy for the past couple of days, and I am having trouble with getting all the list elements on the page.

So the page has a similar structure like this:

In the Parse function of Scrapy, I get all the list elements like this:

def parse(self, response):
        sel = Selector(response)
        all_elements = sel.css('.SomeClass')
        print len(all_elemts)

I know that there are about 300 list elements with that class on the test page that I request, however after printing the len(all_elements), I am getting only 61.

I have tried using xpaths like:

sel.xpath("//*[contains(concat(' ', @class, ' '), 'SomeClass')]")

And yet still I am getting like 61 elements instead of the 300 that I should be.

Also I am using a try and except claws in case one element was to give me an exception.

Here is the actual page I would be scraping: https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter=

Please understand, I am doing this for practice only!

Please Help!Thank You! I just don't know what else to do!

alecxe · Accepted Answer

I am afraid you are dealing with a non well-formed and broken HTML which Scrapy (and underlying lxml) is not able to parse reliably. For instance, see this unclosed div inside the li tag:

Unit: 
     Language Program

I'd switch to parsing the HTML manually with BeautifulSoup. In other words, continue to use all other parts and components of the Scrapy framework, but the HTML-parsing part leave to BeautifulSoup.

Demo from the scrapy shell:

$ scrapy shell "https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter="
In [1]: len(response.css('li.student'))
Out[1]: 55

In [2]: from bs4 import BeautifulSoup

In [3]: soup = BeautifulSoup(response.body)

In [4]: len(soup.select('li.student'))
Out[4]: 281

If you are using a CrawlSpider and need a LinkExtractor based on BeautifulSoup, see:

A scrapy link extractor that uses BeautifulSoup

Spider not scraping the right amount of items

Answers (1)

Related Questions