Reputation: 965
This is the website I'm scraping. The ads in the pages are loading without any problem, but they are always loading with 404 status code, so scrapy doesn't yield items from those links.
If I send a request from shell to an ad, it retries 10 times and a valid response is returned. However, when I run the script with scrapy crawl myspider
command, the ads don't return valid responses, scrapy tries to send request single time.
This is the error code from random items.
2019-07-30 15:33:51 [scrapy] DEBUG: Retrying <GET https://www.classifiedads.com/homes_for_sale/57c10snzt1wzz> (failed 1 times): 404 Not Found
2019-07-30 15:33:51 [scrapy] DEBUG: Retrying <GET https://www.classifiedads.com/homes_for_sale/49zbgqvx21wzz> (failed 1 times): 404 Not Found
2019-07-30 15:33:51 [scrapy] DEBUG: Retrying <GET https://www.classifiedads.com/homes_for_sale/49482b3hq1wzz> (failed 1 times): 404 Not Found
This is my spiders code. How can I handle this problem?
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = [
'https://www.classifiedads.com/search.php?keywords=&cid=468&lid=rx10&lname=India&from=s&page=1',
'https://www.classifiedads.com/search.php?keywords=&cid=18&lid=rx10&lname=India&page=1'
]
rules = (
Rule(LinkExtractor(allow=(r'https://www.classifiedads.com/search.php\?keywords=&cid=468&lid=rx10&lname=India&from=s&page=\d+',)), callback='parse_page', follow=True),
Rule(LinkExtractor(allow=(r'https://www.classifiedads.com/search.php\?keywords=&cid=18&lid=rx10&lname=India&page=\d+',)), callback='parse_page', follow=True)
)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'upgrade-insecure-requests': 1,
}
def parse_page(self, response):
items = response.css('div#results div.resultitem div a::attr(href)').getall()
if items:
for item in items:
if item.startswith('//www.classifiedads.com/'):
yield scrapy.Request(
url='https:{}'.format(item),
method='GET',
headers=self.headers,
callback=self.parse_items
)
def parse_items(self, response):
# scraping the items
Upvotes: 0
Views: 1908
Reputation: 3847
Pass 'handle_httpstatus_list': [404]
in the meta
parameter of your requests to handle 404 responses with your callback, if they are sending valid responses with 404 as status code.
Upvotes: 2
Reputation: 201
The server is throwing 404 response.
You can also check it in your terminal
>>>
import requests
requests.get('https://www.classifiedads.com/commercial_for_rent/9144lxkm81wxd')
<Response [404]>
You can try it with selenium.
Upvotes: 0
Reputation: 1815
I would recommend you to check these retry settings first of all and add 404 status code to RETRY_HTTP_CODES
. Another solution is to create errback
function and add it to your Request. But all these solutions aren't good. Did you try to add some headers, cookies?
Upvotes: 0