Reputation: 111
I'm trying to scrape data from autotrader page and I managed to grab link to every offer on that page but when I'm trying to get data from every offer I get 403 requests status even though I'm using a header. What more can I do to get past it?
headers = {"User Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/85.0.4183.121 Safari/537.36'}
page = requests.get("https://www.autotrader.co.uk/car-details/202010145012219", headers=headers)
print(page.status_code) # 403 forbidden
content_of_page = page.content
soup = bs4.BeautifulSoup(content_of_page, 'lxml')
title = soup.find('h1', {'class': 'advert-heading__title atc-type-insignia atc-type-insignia--medium '})
print(title.text)
[for people that are in the same position: autotrader uses cloudflare to protect every "car-detail" page, so I would suggest using selenium for example]
Upvotes: 0
Views: 1303
Reputation: 1178
If you can manage to get the data via your browser, i.e. you somehow see this data in a website, then you can likely replicate that with requests.
Briefly, you need headers in your request to match the Browser's request:
Your browser sends tons of extra headers, you never know which ones are actually checked by the server so this technique will save you much time.
However, this might fail if there's some protection against blunt request copies, e.g. some temporary tokens, so the requests cannot be reused. In this case you need Selenium (browser emulation/automation), it's not difficult so it worth using.
Upvotes: 2