FoldFence
FoldFence

Reputation: 2802

python - Service Unavailable - urllib proxy not working

I use to get Information from google, I know that I will block after a few requests, that's why I tried to get through Proxies. For the Proxies I use the ProxyBroker from this link: The Link

However, if I use proxies, google returns 503. If I click on the error, google shows me my IP and not the Proxy IP.

Here is what I've tried with:

usedProxy = self.getProxy()
if usedProxy is not None:
    proxies = {"http": "http://%s" % usedProxy[0]}
    headers = {'User-agent': 'Mozilla/5.0'}
    proxy_support = urlrequest.ProxyHandler(proxies)
    opener = urlrequest.build_opener(proxy_support, urlrequest.HTTPHandler(debuglevel=1))
    urlrequest.install_opener(opener)

    req = urlrequest.Request(search_url, None, headers)
    with contextlib.closing(urlrequest.urlopen(req)) as url:
        htmltext = url.read()

I tried with http and https.

Even if the requests is going well, I get a 503 with this the following Message:

send: b'GET http://www.google.co.in/search?q=Test/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.co.in\r\nUser-Agent: Mozilla/5.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date header: Server header: Location header: Pragma header: Expires header: Cache-Control header: Content-Type header: Content-Length header: X-XSS-Protection header: X-Frame-Options header: 

>Connection send: b'GET http://ipv4.google.com/sorry/index?continue=http://www.google.co.in/search%3Fq%3DTest/&q=EgTCDs9XGMbOgNAFIhkA8aeDS0dE8uXKu31DEbfj5mCVdhpUO598MgFy HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: ipv4.google.com\r\nUser-Agent: Mozilla/5.0\r\n

>Connection: close\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'

If the above error doesn't happen, I finally get the following Error:

>[Errno 54] Connection reset by peer

My Questions are:

  1. Is the Ip from the error Link every time my IP and not the Proxy IP?

    Google Error Link

  2. And if it´s every time the Host IP what is shown in the error Message from google and the Problem is from the Proxies, how to bypass the error?

Upvotes: 2

Views: 634

Answers (1)

FoldFence
FoldFence

Reputation: 2802

It seems that Google is knowing that I go to an proxy, because it uses HTTPS and the HTTPS Proxies don´t seem to work. So the HTTP proxies are detected, that´s why I get blocked after 50-60 queries directly.

My Solution:

I tried all Solutions found on Stackoverflow but they doesen´t work fine like Sleep for 10 seconds. But I found a Article with the same Problem, the Solution was "quite" easy then. First I download the fake-useragent Library from Python, which provides a ton of usefull User-agents.

I select randomly a User-agent from this list at each request. I also add to take only common user-agents because otherwise the page has a different HTML which does not fit in my read method.

After installing the Useragent and selecting one randomly, I add a sleep between 15 and 90 seconds, because the article-writer tried different timespans, and with 30 seconds he got block. So with these two simple changes my programm is successfully running since 10 Hours without truble.

I hope this helps you also, because it cost me a bunch of time to figure out when google does block you. So it simple detects every time but let you go with this Configuration.

Have fun and I wish you all successfully crawling!

EDIT:

The Programm gets ~1000 Requests until it get banned.

Upvotes: 1

Related Questions