DiamondJoe12
DiamondJoe12

Reputation: 1833

Response 403 with Selenium web scraper - how to fix?

I have a simple web scraper (using Selenium to scrape in chrome-headless mode, on Ubuntu) that iterates through same pages to get some information:

#set driver options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,1080')
chrome_options.add_argument('--headless')
chrome_options.add_argument("--disable-features=VizDisplayCompositor");
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument("--remote-debugging-port=9222")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.binary_location='/usr/bin/google-chrome-stable'
chrome_driver_binary = "/usr/bin/chromedriver"
driver = webdriver.Chrome(executable_path=chrome_driver_binary, chrome_options=chrome_options)

#Set base url 
base_url = 'www.example.com&page='


events = []
eventContainerBucket = []

for i in range(1,30):

    #cycle through pages in range
    driver.get(base_url + str(i))
    pageURL = base_url + str(i)

    # get events links
    event_list = driver.find_elements_by_css_selector('div[class^=_1abc] a[class^=_1xyz]')
    # collect href attribute of events in even_list
    events.extend(list(event.get_attribute("href") for event in event_list))

print("total events: ", (len(events)))

#GET request user-agent
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}


# iterate through all events and open them.
item = {}
allEvents = []
for event in events:

    try:
        driver.get(event)
        currentUrl = driver.current_url
        print(currentUrl)
    except TimeoutException as ex:
        print(ex.Message)
        webDriver.navigate().refresh()


    try:
        currentRequest = requests.get(currentUrl, headers=headers)
        print (currentRequest)

        #print currentRequest.status_code
    except requests.exceptions.RequestException as e:
        print(e)
        continue

My Issue:

All was working fine until yesterday, when I started getting a 403 error. Typically, the script will iterate through about 20-30 urls no problem, but then it will give me a 403 response.

What I've tried:

Tried changing the requests header to :

headers={'User-Agent': 'Mozilla/5.0'})

Still getting a 403. Do I need to add a wait time to the driver?

Upvotes: 0

Views: 3174

Answers (1)

isopach
isopach

Reputation: 1938

403 denotes that your request has been refused by the server. While it is impossible to guess exactly what the problem is without having access to the actual website, I suggest to make the request look as human-like as possible.

You'd want to make sure the headers used in headless selenium matches the one you (automatically) send when visiting the site. Follow these steps:

  1. Access the website from your browser manually
  2. Inspect the Network requests. In Chrome you press F12 or Ctrl+Shift+I and select the Network tab, then browse/reload the page you want to access
  3. Copy the request to pageURL as curl command, then extract the -H headers
  4. Put these in your code such as below:
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
         'custom-header': 'custom value',
         'cookie': '__cf_bm=some_random_value;'
}

There could also be the possibility that your IP address is blocked, in which case you should try a proxy as follows:

PROXY = "1.111.111.1:8080" #your proxy

chrome_options = WebDriverWait.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)

Upvotes: 1

Related Questions