Reputation: 657

Scrapy crawler not able to crawl data from multiple pages

I am trying to scrap result of the following page :

http://www.peekyou.com/work/autodesk/page=1

with page = 1,2,3,4 ... so on as per the results. So I am getting a php file to run the crawler run it for different page numbers. The code (for a single page) is as follows:

`import sys
 from scrapy.spider import BaseSpider
 from scrapy.selector import HtmlXPathSelector
 from scrapy.contrib.spiders import CrawlSpider, Rule
 from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
 from scrapy.selector import HtmlXPathSelector
 from scrapy.item import Item
 from scrapy.http import Request
 #from scrapy.crawler import CrawlerProcess

 class DmozSpider(BaseSpider):
 name = "peekyou_crawler"

 start_urls = ["http://www.peekyou.com/work/autodesk/page=1"];

 def parse(self, response):

     hxs = HtmlXPathSelector(response)

     discovery = hxs.select('//div[@class="nextPage"]/table/tr[2]/td/a[contains(@title,"Next")]')
     print len(discovery)

     print "Starting the actual file"
     items = hxs.select('//div[@class="resultCell"]')
     count = 0
     for newsItem in items:
        print newsItem

        url=newsItem.select('h2/a/@href').extract()
        name = newsItem.select('h2/a/span/text()').extract()
        count = count + 1
        print count
        print url[0]
        print name[0]

        print "\n"

` The Autodesk result page has 18 pages. When I run the code to crawl all the pages, the crawler only gets data from page 2 and not all pages. Similarly, I changed the company name to be something else. Again, it scraps some pages and rest not. I am getting http response 200 on each of the page although. Also, even I keep running it again, it continues to scrap the same pages always but not all always. Any idea as to what could be the error in my approach or something am I missing ?

Thanks in advance.

Upvotes: 1

Answers (2)

alecxe

Reputation: 473873

I'll give you a starting point.

The page you're trying to crawl is loaded via AJAX, this is a problem with scrapy - it cannot deal with dynamic page load via ajax XHR requests. For more info see:

Using browser developer tools, you could notice that there is an outgoing POST request going after the page load. It's going to http://www.peekyou.com/work/autodesk/web_results/web_tag_search_checker.php.

So, simulating this in scrapy should help you to crawl necessary data:

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class DmozItem(Item):
    name = Field()
    link = Field()


class DmozSpider(BaseSpider):
    name = "peekyou_crawler"

    start_urls = start_urls = [
        "http://www.peekyou.com/work/autodesk/page=%d" % i for i in xrange(18)
    ]

    def parse(self, response):
        yield FormRequest(url="http://www.peekyou.com/work/autodesk/web_results/web_tag_search_checker.php",
                          formdata={'id': 'search_work_a10362ede5ed8ed5ff1191321978f12a',
                                    '_': ''},
                          method="POST",
                          callback=self.after_post)

    def after_post(self, response):
        hxs = HtmlXPathSelector(response)

        persons = hxs.select("//div[@class='resultCell']")

        for person in persons:
            item = DmozItem()
            item['name'] = person.select('.//h2/a/span/text()').extract()[0].strip()
            item['link'] = person.select('.//h2/a/@href').extract()[0].strip()
            yield item

It works, but it dumps only the first page. I'll leave it for you to understand how can you get other results.

Hope that helps.

Upvotes: 1

furas

Reputation: 142661

You can add more addresses:

start_urls = [
    "http://www.peekyou.com/work/autodesk/page=1",
    "http://www.peekyou.com/work/autodesk/page=2",
    "http://www.peekyou.com/work/autodesk/page=3"
];

You can generate more addresses:

start_urls = [
    "http://www.peekyou.com/work/autodesk/page=%d" % i for i in xrange(18)
];

I think you should read about start_requests() and how to generate next url. But I can't help you here, because I don't use Scrapy. I still use pure python (and pyQuery) to create simple crawlers ;)

PS. Sometimes servers check your UserAgent, IP, how fast you grap next page and stop sending pages to you.

Upvotes: 1

Scrapy crawler not able to crawl data from multiple pages

Answers (2)

Related Questions