Reputation: 1833
I have a simple web scraper (using Selenium to scrape in chrome-headless mode, on Ubuntu) that iterates through same pages to get some information:
#set driver options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,1080')
chrome_options.add_argument('--headless')
chrome_options.add_argument("--disable-features=VizDisplayCompositor");
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument("--remote-debugging-port=9222")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.binary_location='/usr/bin/google-chrome-stable'
chrome_driver_binary = "/usr/bin/chromedriver"
driver = webdriver.Chrome(executable_path=chrome_driver_binary, chrome_options=chrome_options)
#Set base url
base_url = 'www.example.com&page='
events = []
eventContainerBucket = []
for i in range(1,30):
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=_1abc] a[class^=_1xyz]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
print("total events: ", (len(events)))
#GET request user-agent
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
# iterate through all events and open them.
item = {}
allEvents = []
for event in events:
try:
driver.get(event)
currentUrl = driver.current_url
print(currentUrl)
except TimeoutException as ex:
print(ex.Message)
webDriver.navigate().refresh()
try:
currentRequest = requests.get(currentUrl, headers=headers)
print (currentRequest)
#print currentRequest.status_code
except requests.exceptions.RequestException as e:
print(e)
continue
My Issue:
All was working fine until yesterday, when I started getting a 403 error. Typically, the script will iterate through about 20-30 urls no problem, but then it will give me a 403 response.
What I've tried:
Tried changing the requests header to :
headers={'User-Agent': 'Mozilla/5.0'})
Still getting a 403. Do I need to add a wait time to the driver?
Upvotes: 0
Views: 3174
Reputation: 1938
403 denotes that your request has been refused by the server. While it is impossible to guess exactly what the problem is without having access to the actual website, I suggest to make the request look as human-like as possible.
You'd want to make sure the headers used in headless selenium matches the one you (automatically) send when visiting the site. Follow these steps:
pageURL
as curl command, then extract the -H
headersheaders={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
'custom-header': 'custom value',
'cookie': '__cf_bm=some_random_value;'
}
There could also be the possibility that your IP address is blocked, in which case you should try a proxy as follows:
PROXY = "1.111.111.1:8080" #your proxy
chrome_options = WebDriverWait.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
Upvotes: 1