Reputation: 35
I have an issue in the "start_requests" function in python. I am using proxy and port for scraping data from another site. But I got:
[scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min) [scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://....../> (failed 2 times): TCP connection timed out: 110: Connection timed out.
My code is:
def get_proxy(self):
self.conn = MySQLdb.connect(
settings['MYSQL_HOST'],
settings['MYSQL_USER'],
settings['MYSQL_PASSWD'],
settings['MYSQL_DBNAME'],
charset = "utf8", use_unicode = True)
self.cursor = self.conn.cursor()
try:
results = self.cursor.execute("SELECT proxy, port FROM geme_proxies WHERE is_active = '1' AND is_deleted = '0' ORDER BY RAND() LIMIT 1" )
if results > 0:
row = self.cursor.fetchone()
return row
else:
return
except Exception, e:
logger.error('Exception Message: '+ str(e))
def start_requests(self):
proxy_data = self.get_proxy();
urls = [settings['OBERWIL_NEWS_URL']]
for url in urls:
request = scrapy.Request(url = url, callback = self.parse)
request.meta['proxy'] = 'http://' + proxy_data[0] + ':' + proxy_data[1]
proxy_user_pass = settings['PROXY_USERNAME'] + ':' + settings['PROXY_PASSWORD']
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
yield request
Please help me to solve this issue.
Upvotes: 0
Views: 1030
Reputation: 622
I believe, this isn't a proper approach to use proxies in your code. (Free) Proxies die very often or become irrespective without any warning and since you are using a single proxy for loading all of your URLs, if first randomly chosen proxy has any issue(s), you will end up with the error.
A better approach would be to use "rotating proxies" instead:
pip install scrapy-rotated-proxy
This will allow you to rotate proxies transparently without having to handle middle processes yourself. The approach only requires installing the respository and then gradually updating the proxy list (file: proxylist.txt).
Activate using:
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620
proxylist.txt:
165.22.50.208:8080
139.180.163.43:3128
14.207.137.192:8080
Rotating-proxies also have option(s) for switching from file to database along with other useful options for further optimizing your crawlers with respect to target website.
Upvotes: 1