Reputation: 2943
I have wrote a super simple script to scrape a little bit of data from a website (100 entries max) for personal use (just so i can make a quick comparison).
But whenever I receive the requested page I get a different page saying that they think I am not a real user (which is true). How do I circumvent this? Since if I open the url from the code in a new incognito window it loads.
So am I missing some specific headers?
Or do I need to do something different?
This is the code I have made so far:
import requests
from lxml import etree
import mysql.connector
from mysql.connector import Error
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8,la;q=0.7',
'Referer': 'https://www.google.com/',
'sec-ch-ua': '"Google Chrome";v="87", " Not;A Brand";v="99", "Chromium";v="87"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'cross-site',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
}
base_request_url = 'https://www.funda.nl/koop/gemeente-eindhoven/verkocht/200000-350000/sorteer-postcode-af/p'
request_page_id = 1
request_url = base_request_url + str(request_page_id)
res = requests.get(request_url, headers = headers)
print(res.text)
Upvotes: 0
Views: 1226
Reputation: 1817
You can change your headers
to
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36'}
And here is an alternative using selenium
in headless mode.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
#Set up user agent to avoid bot detection.
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
#Specify headless mode
chrome_options.add_argument("--headless")
#Add user agent
chrome_options.add_argument(f'user-agent={user_agent}')
DRIVER_PATH = "path/to/chromedriver"
driver = webdriver.Chrome(DRIVER_PATH, options=chrome_options)
driver.get("https://www.funda.nl/koop/gemeente-eindhoven/verkocht/200000-350000/sorteer-postcode-af/p")
page_source = driver.page_source
Upvotes: 1
Reputation: 662
maybe it is enough for your case, instead of bypassing this problem, to simply download the page(s). Then you can specify the html file as (base_)request_url.
(Sorry I haven't enough reputation yet for writing comments)
Upvotes: 1
Reputation: 9543
Can you take the header from the actual browser you used before. Here is an example of how to get it in firefox. Toggling the 'raw' toggle helps:
Upvotes: 0