jstats
jstats

Reputation: 35

web scraper with HTTP Error 503: Service Unavailable

I am trying to build a scraper, but I keep getting the 503 blocking error. I can still access the website manually, so my IP address hasn't been blocked. I keep switching user agents and still can't get my code to run all the way through. Sometimes I get up to 15, sometimes I don't get any, but it always fails eventually. I have no doubt that I'm doing something wrong in my code. I did shave it down to fit, though, so please keep that in mind. How do I fix this without using third parties?

import requests
import urllib2
from urllib2 import urlopen     
import random
from contextlib import closing
from bs4 import BeautifulSoup
import ssl
import parser
import time
from time import sleep

def Parser(urls):
    randomint = random.randint(0, 2)
    randomtime = random.randint(5, 30)

    url = "https://www.website.com"   
    user_agents = [
    "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)",
"Opera/9.80 (Windows NT 6.1; U; cs) Presto/2.2.15 Version/10.00"   
    ]
    index = 0
    opener = urllib2.build_opener()
    req = opener.addheaders = [('User-agent', user_agents[randomint])]

def ReadUPC():
    UPCList = [
    'upc',
    'upc2',
    'upc3',
    'upc4',
    'etc.'
   ]          

    extracted_data = []
    for i in UPCList:
        urls = "https://www.website.com" + i
        randomtime = random.randint(5, 30)
        Soup = BeautifulSoup(urlopen(urls), "lxml")
        price = Soup.find("span", { "class": "a-size-base a-color-price s-price a-text-bold"})
        sleep(randomtime)

        randomt = random.randint(5, 15)
        print "ref url:", urls
        sleep(randomt)
        print "Our price:",price
        sleep(randomtime)

if __name__ == "__main__":
    ReadUPC()
    index = index + 1     

sleep(10)



    554 class HTTPDefaultErrorHandler(BaseHandler):
    555     def http_error_default(self, req, fp, code, msg, hdrs):
    556         raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    557 
    558 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 503: Service Unavailable

Upvotes: 2

Views: 6041

Answers (2)

Umair Ayub
Umair Ayub

Reputation: 21271

What website you are scraping? most websites uses cookies to recognize the user as well. Please enable cookies in your code.

Also open that link in browser and along with Firebug and see Headers being sent to server by your browser while making request. and then try to fake all those headers.

PS:

In my view, sending random user-agent strings from SAME IP wont make any difference, unless you are rotating IPs.

Upvotes: 1

Marcus Müller
Marcus Müller

Reputation: 36442

Behave like a normal human being using a browser. That website seems to be designed to analyze your behaviour and sees that you're a scraper, and wants to block you; in the easiest case, a minimal JavaScript that changes link URLs on the fly would be enough to disable "dumb" scrapers.

There's elegant ways to solve this dilemma, for example by instrumenting a browser, but that won't happen without external tools.

Upvotes: 0

Related Questions