Nazariy
Nazariy

Reputation: 727

Spider not scraping the right amount of items

I have been learning Scrapy for the past couple of days, and I am having trouble with getting all the list elements on the page.

So the page has a similar structure like this:

<ol class="list-results">
    <li class="SomeClass i">
        <ul>
            <li class="name">Name1</li>
        </ul>
    </li>
    <li class="SomeClass 0">
        <ul>
            <li class="name">Name2</li>
        </ul>
    </li>
    <li class="SomeClass i">
        <ul>
            <li class="name">Name3/li>
        </ul>
    </li>
</ol>

In the Parse function of Scrapy, I get all the list elements like this:

def parse(self, response):
        sel = Selector(response)
        all_elements = sel.css('.SomeClass')
        print len(all_elemts)

I know that there are about 300 list elements with that class on the test page that I request, however after printing the len(all_elements), I am getting only 61.

I have tried using xpaths like:

sel.xpath("//*[contains(concat(' ', @class, ' '), 'SomeClass')]")

And yet still I am getting like 61 elements instead of the 300 that I should be.

Also I am using a try and except claws in case one element was to give me an exception.

Here is the actual page I would be scraping: https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter=

Please understand, I am doing this for practice only!

Please Help!Thank You! I just don't know what else to do!

Upvotes: 4

Views: 74

Answers (1)

alecxe
alecxe

Reputation: 473873

I am afraid you are dealing with a non well-formed and broken HTML which Scrapy (and underlying lxml) is not able to parse reliably. For instance, see this unclosed div inside the li tag:

<li class="unit"><span>Unit:</span> 
    <div class="unit-block"> Language Program                  
</li>

I'd switch to parsing the HTML manually with BeautifulSoup. In other words, continue to use all other parts and components of the Scrapy framework, but the HTML-parsing part leave to BeautifulSoup.

Demo from the scrapy shell:

$ scrapy shell "https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter="
In [1]: len(response.css('li.student'))
Out[1]: 55

In [2]: from bs4 import BeautifulSoup

In [3]: soup = BeautifulSoup(response.body)

In [4]: len(soup.select('li.student'))
Out[4]: 281

If you are using a CrawlSpider and need a LinkExtractor based on BeautifulSoup, see:

Upvotes: 2

Related Questions