Reputation: 99
I have created a basic spider to scrape a small group of job listings from totaljobs.com. I have set up the spider with a single start URL, to bring up the list of jobs I am interested in. From there, I launch a separate request for each page of the results. Within each of these requests, I launch a separate request calling back to a different parse method, to handle the individual job URLs.
What I'm finding is that the start URL and all of the results page requests are handled fine - scrapy connects to the site and returns the page content. However, when it attempts to follow the URLs for each individual job page, scrapy isn't able to form a connection. Within my log file, it states:
[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
I'm afraid that I don't have a huge amount of programming experience or knowledge of internet protocols etc. so please forgive me for not being able to provide more information on what might be going on here. I have tried changing the TLS connection type; updating to the latest version of scrapy, twisted and OpenSSL; rolling back to previous versions of scrapy, twisted and OpenSSL; rolling back the cryptography version, creating a custom Context Factory and trying various browser agents and proxies. I get the same outcome every time: whenever the URL relates to a specific job page, scrapy cannot connect and I get the above log file output.
It may be likely that I am overlooking something very obvious to seasoned scrapers, that is preventing me from connecting with scrapy. I have tried following some of the the advice in these threads:
https://github.com/scrapy/scrapy/issues/1429
https://github.com/requests/requests/issues/4458
https://github.com/scrapy/scrapy/issues/2717
However, some of it is a bit over my head e.g. how to update cipher lists etc. I presume that it is some kind of certification issue, but then again scrapy is able to connect to other URLs on that domain, so I don't know.
The code that I've been using to test this is very basic, but here it is anyway:
import scrapy
class Test(scrapy.Spider):
start_urls = [
'https://www.totaljobs.com/job/welder/jark-wakefield-job79229824'
,'https://www.totaljobs.com/job/welder/elliott-wragg-ltd-job78969310'
,'https://www.totaljobs.com/job/welder/exo-technical-job79019672'
,'https://www.totaljobs.com/job/welder/exo-technical-job79074694'
]
name = "test"
def parse(self, response):
print 'aaaa'
yield {'a': 1}
The URLs in the above code are not being connected to successfully.
The URLs in the below code are being connected to successfully.
import scrapy
class Test(scrapy.Spider):
start_urls = [
'https://www.totaljobs.com/jobs/permanent/welder/in-uk'
,'https://www.totaljobs.com/jobs/permanent/mig-welder/in-uk'
,'https://www.totaljobs.com/jobs/permanent/tig-welder/in-uk'
]
name = "test"
def parse(self, response):
print 'aaaa'
yield {'a': 1}
It'd be great if someone could replicate this behavior (or not as the case may be) and let me know. Please let me know if I should submit additional details. I apologise, if I have overlooked something really obvious. I am using:
Windows 7 64 bit
Python 2.7
scrapy version 1.5.0
twisted version 17.9.0
openSSL version 17.5.0
lxml version 4.1.1
Upvotes: 1
Views: 4428
Reputation: 19156
In my case, it was caused by user agent got rejected. We should change the user agent for each request. For that, you should use scrapy fake user agent and then use this middleware to make sure it changes the user agent in each retry.
from scrapy.downloadermiddlewares.retry import *
from scrapy.spidermiddlewares.httperror import *
from fake_useragent import UserAgent
class Retry500Middleware(RetryMiddleware):
def __init__(self, settings):
super(Retry500Middleware, self).__init__(settings)
fallback = settings.get('FAKEUSERAGENT_FALLBACK', None)
self.ua = UserAgent(fallback=fallback)
self.ua_type = settings.get('RANDOM_UA_TYPE', 'random')
def get_ua(self):
'''Gets random UA based on the type setting (random, firefox…)'''
return getattr(self.ua, self.ua_type)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, reason, spider) or response
return response
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, exception, spider)
Upvotes: 0
Reputation: 223
This is a link to a blog I recently read about responsible web scraping with Scrapy. Hopefully it's helpful
Upvotes: 0
Reputation: 41
You can probably try setting a user agent and see if that changes things.
You might also try to do requests with bigger delays between them or with a proxy.
As it is a jobs website, I imagine they have some sort of anti-scraping mechanism.
This is not an amazing answer, but it is some insight I can share with you to maybe help you figure out your next steps.
Upvotes: 1