Reputation: 1138
I'm trying to scrape product URLs from the Amazon Webshop, by going through every page.
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
products = set()
for i in range(1, 21):
url = 'https://www.amazon.fr/s?k=phone%2Bcase&page=' + str(i)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup) # prints the HTML content saying Error on Amazon's side
links = soup.select('a.a-link-normal.a-text-normal')
for tag in links:
url_product = 'https://www.amazon.fr' + tag.attrs['href']
products.add(url_product)
Instead of getting the content of the page, I get a "Sorry, something went wrong on our end" HTML Error Page. What is the reason behind this? How can I successfully bypass this error and scrape the products?
Upvotes: 1
Views: 2011
Reputation: 11515
According to your question:
Be informed that AMAZON
not allowing automated access to for it's data! So you can double check this by checking the response via r.status_code
! which can lead you to have that error MSG:
To discuss automated access to Amazon data please contact [email protected]
Therefore you can use AMAZON API
or you can pass a list of proxies
to the GET request via proxies = list_proxies
.
Here's the correct way to pass headers
to Amazon
without getting block and it's Works.
import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'www.amazon.fr',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
for item in range(1, 21):
r = requests.get(
'https://www.amazon.fr/s?k=phone+case&page={item}&ref=sr_pg_{item}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', attrs={'class': 'a-link-normal a-text-normal'}):
print(f"https://www.amazon.fr{item.get('href')}")
Run Online: Click Here
Upvotes: 3