user21306
user21306

Reputation: 71

Why is my second request not getting called in the parse method of my scrapy spider

I'm trying to get two requests made, but only the first one gets called. More specifically, it seems that only the first callback, parse(), gets called. I specified parse2 as the second callback but it doesn't seem to be invoked at all, according to my output.

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["https://www.reddit.com"]
    start_urls = (
        'https://www.reddit.com/',
    )

    def start_requests(self):
        return [scrapy.Request(
            self.start_urls[0],
            method='GET'
        )]

    def parse(self, response):
        return [scrapy.Request(self.start_urls[0], callback=self.parse2)]

    def parse2(self, response):
        print(response.body[:40])

output:

2016-01-15 01:19:19 [scrapy] INFO: Scrapy 1.0.4 started (bot: example)
2016-01-15 01:19:19 [scrapy] INFO: Optional features available: ssl, http11
2016-01-15 01:19:19 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'example.spiders', 'SPIDER_MODULES': ['example.spiders'], 'BOT_NAME': 'example'}
2016-01-15 01:19:19 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-15 01:19:19 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-15 01:19:19 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-15 01:19:19 [scrapy] INFO: Enabled item pipelines:
2016-01-15 01:19:19 [scrapy] INFO: Spider opened
2016-01-15 01:19:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-15 01:19:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-15 01:19:19 [scrapy] DEBUG: Crawled (200) <GET https://www.reddit.com/> (referer: None)
2016-01-15 01:19:19 [scrapy] DEBUG: Filtered offsite request to 'www.reddit.com': <GET https://www.reddit.com/>
2016-01-15 01:19:19 [scrapy] INFO: Closing spider (finished)
2016-01-15 01:19:19 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 212,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 22307,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 1, 15, 6, 19, 19, 898275),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'offsite/domains': 1,
 'offsite/filtered': 1,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 1, 15, 6, 19, 19, 692480)}

Upvotes: 2

Views: 973

Answers (1)

alecxe
alecxe

Reputation: 473863

That's because it is filtered out as a duplicate request. To change the behavior pass the dont_filter=True when you issue a request:

def parse(self, response):
    return scrapy.Request(self.start_urls[0], 
                          callback=self.parse2, 
                          dont_filter=True)

Upvotes: 3

Related Questions