Response 403 with Selenium web scraper - how to fix?

Question

I have a simple web scraper (using Selenium to scrape in chrome-headless mode, on Ubuntu) that iterates through same pages to get some information:

#set driver options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,1080')
chrome_options.add_argument('--headless')
chrome_options.add_argument("--disable-features=VizDisplayCompositor");
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument("--remote-debugging-port=9222")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.binary_location='/usr/bin/google-chrome-stable'
chrome_driver_binary = "/usr/bin/chromedriver"
driver = webdriver.Chrome(executable_path=chrome_driver_binary, chrome_options=chrome_options)

#Set base url 
base_url = 'www.example.com&page='


events = []
eventContainerBucket = []

for i in range(1,30):

    #cycle through pages in range
    driver.get(base_url + str(i))
    pageURL = base_url + str(i)

    # get events links
    event_list = driver.find_elements_by_css_selector('div[class^=_1abc] a[class^=_1xyz]')
    # collect href attribute of events in even_list
    events.extend(list(event.get_attribute("href") for event in event_list))

print("total events: ", (len(events)))

#GET request user-agent
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}


# iterate through all events and open them.
item = {}
allEvents = []
for event in events:

    try:
        driver.get(event)
        currentUrl = driver.current_url
        print(currentUrl)
    except TimeoutException as ex:
        print(ex.Message)
        webDriver.navigate().refresh()


    try:
        currentRequest = requests.get(currentUrl, headers=headers)
        print (currentRequest)

        #print currentRequest.status_code
    except requests.exceptions.RequestException as e:
        print(e)
        continue

My Issue:

All was working fine until yesterday, when I started getting a 403 error. Typically, the script will iterate through about 20-30 urls no problem, but then it will give me a 403 response.

What I've tried:

Tried changing the requests header to :

headers={'User-Agent': 'Mozilla/5.0'})

Still getting a 403. Do I need to add a wait time to the driver?

Response 403 with Selenium web scraper - how to fix?

Answers (1)

Related Questions