Tanaka Chitete
Tanaka Chitete

Reputation: 1

Can't webscrape using Python and beautiful soup

I am trying to do some webscraping (for the Automate the Boring Stuff with Python udemy course) but I keep getting the HTTPError: 403 Client Error: HTTP Forbidden for url: error. Here is the code I have been working with:

import bs4
import requests
ro = requests.get('https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/')
ro.raise_for_status()

And here's the error message I have been getting:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    ro.raise_for_status()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: HTTP Forbidden for url: https://www.carsales.com.au/cars/details/2012-mazda-3-neo-bl-series-2-auto/SSE-AD-6368302/

I have read online about changing the user agent but I don't understand what that is or how to do that either. Can anyone offer some help here? I am completely lost and I can't seem to get any webscraping information anywhere. I am on Mac if that helps at all. Thanks.

Upvotes: 0

Views: 345

Answers (2)

Stefan
Stefan

Reputation: 3051

The requests package allows you to change your user agent, this makes the server think you're a different browser.

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0'}
ro = requests.get('https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994/', headers=headers)
ro.raise_for_status()

soup = BeautifulSoup(ro.text, 'html.parser')
print(soup.prettify())

Upvotes: 1

Meh
Meh

Reputation: 196

First, I would suggest replacing ro.raise_for_status() by ro.status_code with if statements or a switch-case statment, however, if you want to use ro.raise_for_status() you may want to use it inside try-catch block. Regarding to the error, Amazon seems to block the requests that has default requests module user-agent, to overcome this, you may want to change the user-agent to something like: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36, for further information about implementing this, please check this page, Using Python Requests section.

P.S: please make sure to check if web scraping Amazon is legal.

Upvotes: 0

Related Questions