web scraper with HTTP Error 503: Service Unavailable

Question

I am trying to build a scraper, but I keep getting the 503 blocking error. I can still access the website manually, so my IP address hasn't been blocked. I keep switching user agents and still can't get my code to run all the way through. Sometimes I get up to 15, sometimes I don't get any, but it always fails eventually. I have no doubt that I'm doing something wrong in my code. I did shave it down to fit, though, so please keep that in mind. How do I fix this without using third parties?

import requests
import urllib2
from urllib2 import urlopen     
import random
from contextlib import closing
from bs4 import BeautifulSoup
import ssl
import parser
import time
from time import sleep

def Parser(urls):
    randomint = random.randint(0, 2)
    randomtime = random.randint(5, 30)

    url = "https://www.website.com"   
    user_agents = [
    "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)",
"Opera/9.80 (Windows NT 6.1; U; cs) Presto/2.2.15 Version/10.00"   
    ]
    index = 0
    opener = urllib2.build_opener()
    req = opener.addheaders = [('User-agent', user_agents[randomint])]

def ReadUPC():
    UPCList = [
    'upc',
    'upc2',
    'upc3',
    'upc4',
    'etc.'
   ]          

    extracted_data = []
    for i in UPCList:
        urls = "https://www.website.com" + i
        randomtime = random.randint(5, 30)
        Soup = BeautifulSoup(urlopen(urls), "lxml")
        price = Soup.find("span", { "class": "a-size-base a-color-price s-price a-text-bold"})
        sleep(randomtime)

        randomt = random.randint(5, 15)
        print "ref url:", urls
        sleep(randomt)
        print "Our price:",price
        sleep(randomtime)

if __name__ == "__main__":
    ReadUPC()
    index = index + 1     

sleep(10)



    554 class HTTPDefaultErrorHandler(BaseHandler):
    555     def http_error_default(self, req, fp, code, msg, hdrs):
    556         raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    557 
    558 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 503: Service Unavailable

Marcus M&#252;ller · Accepted Answer

Behave like a normal human being using a browser. That website seems to be designed to analyze your behaviour and sees that you're a scraper, and wants to block you; in the easiest case, a minimal JavaScript that changes link URLs on the fly would be enough to disable "dumb" scrapers.

There's elegant ways to solve this dilemma, for example by instrumenting a browser, but that won't happen without external tools.

web scraper with HTTP Error 503: Service Unavailable

Answers (2)

Related Questions