Reputation: 727
I have been learning Scrapy for the past couple of days, and I am having trouble with getting all the list elements on the page.
So the page has a similar structure like this:
<ol class="list-results">
<li class="SomeClass i">
<ul>
<li class="name">Name1</li>
</ul>
</li>
<li class="SomeClass 0">
<ul>
<li class="name">Name2</li>
</ul>
</li>
<li class="SomeClass i">
<ul>
<li class="name">Name3/li>
</ul>
</li>
</ol>
In the Parse function of Scrapy, I get all the list elements like this:
def parse(self, response):
sel = Selector(response)
all_elements = sel.css('.SomeClass')
print len(all_elemts)
I know that there are about 300 list elements with that class on the test page that I request, however after printing the len(all_elements), I am getting only 61.
I have tried using xpaths like:
sel.xpath("//*[contains(concat(' ', @class, ' '), 'SomeClass')]")
And yet still I am getting like 61 elements instead of the 300 that I should be.
Also I am using a try and except claws in case one element was to give me an exception.
Here is the actual page I would be scraping: https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter=
Please understand, I am doing this for practice only!
Please Help!Thank You! I just don't know what else to do!
Upvotes: 4
Views: 74
Reputation: 473873
I am afraid you are dealing with a non well-formed and broken HTML which Scrapy (and underlying lxml
) is not able to parse reliably. For instance, see this unclosed div
inside the li
tag:
<li class="unit"><span>Unit:</span>
<div class="unit-block"> Language Program
</li>
I'd switch to parsing the HTML manually with BeautifulSoup
. In other words, continue to use all other parts and components of the Scrapy framework, but the HTML-parsing part leave to BeautifulSoup
.
Demo from the scrapy shell
:
$ scrapy shell "https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter="
In [1]: len(response.css('li.student'))
Out[1]: 55
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(response.body)
In [4]: len(soup.select('li.student'))
Out[4]: 281
If you are using a CrawlSpider
and need a LinkExtractor
based on BeautifulSoup
, see:
Upvotes: 2