gunesevitan
gunesevitan

Reputation: 965

Scrapy - Handling a page which loads with 404 status code

This is the website I'm scraping. The ads in the pages are loading without any problem, but they are always loading with 404 status code, so scrapy doesn't yield items from those links.

If I send a request from shell to an ad, it retries 10 times and a valid response is returned. However, when I run the script with scrapy crawl myspider command, the ads don't return valid responses, scrapy tries to send request single time.

This is the error code from random items.

2019-07-30 15:33:51 [scrapy] DEBUG: Retrying <GET https://www.classifiedads.com/homes_for_sale/57c10snzt1wzz> (failed 1 times): 404 Not Found
2019-07-30 15:33:51 [scrapy] DEBUG: Retrying <GET https://www.classifiedads.com/homes_for_sale/49zbgqvx21wzz> (failed 1 times): 404 Not Found
2019-07-30 15:33:51 [scrapy] DEBUG: Retrying <GET https://www.classifiedads.com/homes_for_sale/49482b3hq1wzz> (failed 1 times): 404 Not Found

This is my spiders code. How can I handle this problem?

class MySpider(CrawlSpider):

    name = 'myspider'

    start_urls = [
        'https://www.classifiedads.com/search.php?keywords=&cid=468&lid=rx10&lname=India&from=s&page=1',
        'https://www.classifiedads.com/search.php?keywords=&cid=18&lid=rx10&lname=India&page=1'
    ]

    rules = (
        Rule(LinkExtractor(allow=(r'https://www.classifiedads.com/search.php\?keywords=&cid=468&lid=rx10&lname=India&from=s&page=\d+',)), callback='parse_page', follow=True),
        Rule(LinkExtractor(allow=(r'https://www.classifiedads.com/search.php\?keywords=&cid=18&lid=rx10&lname=India&page=\d+',)), callback='parse_page', follow=True)
    )

    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
        'upgrade-insecure-requests': 1,

    }

    def parse_page(self, response):
        items = response.css('div#results div.resultitem div a::attr(href)').getall()

        if items:
            for item in items:
                if item.startswith('//www.classifiedads.com/'):
                    yield scrapy.Request(
                        url='https:{}'.format(item),
                        method='GET',
                        headers=self.headers,
                        callback=self.parse_items
                    )

    def parse_items(self, response):
        # scraping the items

Upvotes: 0

Views: 1908

Answers (3)

Gallaecio
Gallaecio

Reputation: 3847

Pass 'handle_httpstatus_list': [404] in the meta parameter of your requests to handle 404 responses with your callback, if they are sending valid responses with 404 as status code.

Upvotes: 2

harry
harry

Reputation: 201

The server is throwing 404 response. You can also check it in your terminal >>> import requests requests.get('https://www.classifiedads.com/commercial_for_rent/9144lxkm81wxd') <Response [404]>

You can try it with selenium.

Upvotes: 0

amarynets
amarynets

Reputation: 1815

I would recommend you to check these retry settings first of all and add 404 status code to RETRY_HTTP_CODES. Another solution is to create errback function and add it to your Request. But all these solutions aren't good. Did you try to add some headers, cookies?

Upvotes: 0

Related Questions