Reputation: 1938
I'm trying to scrape a website via Scrapy. However, the website is extremely slow at times and it takes almost 15-20 seconds to respond at first request in browser. Anyways, sometimes, when I try to crawl the website using Scrapy, I keep getting TCP Timeout error. Even though the website opens just fine on my browser. Here's the message:
2017-09-05 17:34:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.hosane.com/result/spec
ialList> (failed 16 times): TCP connection timed out: 10060: A connection attempt failed because the connected party di
d not properly respond after a period of time, or established connection failed because connected host has failed to re
spond..
I have even overridden the USER_AGENT
setting for testing.
I don't think DOWNLOAD_TIMEOUT
setting works in this case, since it defaults to 180 seconds, and Scrapy doesn't even take 20-30 seconds before giving a TCP timeout error.
Any idea what is causing this issue? Is there a way to set TCP timeout in Scrapy?
Upvotes: 4
Views: 4656
Reputation: 20748
A TCP connection timed out
can happen before the Scrapy-specified DOWNLOAD_TIMEOUT
because the actual initial TCP connect timeout is defined by the OS, usually in terms of TCP SYN
packet retransmissions.
By default on my Linux box, I have 6 retransmissions:
cat /proc/sys/net/ipv4/tcp_syn_retries
6
which, in practice, for Scrapy too, means 0 + 1 + 2 + 4 + 8 + 16 + 32 (+64) = 127 seconds
before receiveing a twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out.
from Twisted. (That's the initial trial, then exponential backoff between each retry and not receiving a reply after the 6th retry.)
If I set /proc/sys/net/ipv4/tcp_syn_retries
to 8 for example, I can verify that I receive this instead:
User timeout caused connection failure: Getting http://www.hosane.com/result/specialList took longer than 180.0 seconds.
That's because 0+1+2+4+8+16+32+64+128(+256) > 180
.
10060: A connection attempt failed...
seems to be a Windows socket error code. If you want to change the TCP connection timeout to something at least the DOWNLOAD_TIMEOUT
, you'll need to change the TCP SYN
retry count. (I don't know how to do it on your system, but Google is your friend.)
Upvotes: 10