Snow
Snow

Reputation: 1138

Scraping links with BeautifulSoup from all pages in Amazon results in error

I'm trying to scrape product URLs from the Amazon Webshop, by going through every page.

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64;     x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate",     "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

products = set()
for i in range(1, 21):
    url = 'https://www.amazon.fr/s?k=phone%2Bcase&page=' + str(i)
    response = requests.get(url, headers=headers)

    soup = BeautifulSoup(response.content)

    print(soup) # prints the HTML content saying Error on Amazon's side

    links = soup.select('a.a-link-normal.a-text-normal')

    for tag in links:
        url_product = 'https://www.amazon.fr' + tag.attrs['href']
        products.add(url_product)

Instead of getting the content of the page, I get a "Sorry, something went wrong on our end" HTML Error Page. What is the reason behind this? How can I successfully bypass this error and scrape the products?

Upvotes: 1

Views: 2011

Answers (1)

According to your question:

Be informed that AMAZON not allowing automated access to for it's data! So you can double check this by checking the response via r.status_code ! which can lead you to have that error MSG:

To discuss automated access to Amazon data please contact [email protected]

Therefore you can use AMAZON API or you can pass a list of proxies to the GET request via proxies = list_proxies.

Here's the correct way to pass headers to Amazon without getting block and it's Works.

import requests
from bs4 import BeautifulSoup

headers = {
    'Host': 'www.amazon.fr',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'TE': 'Trailers'
}

for item in range(1, 21):
    r = requests.get(
        'https://www.amazon.fr/s?k=phone+case&page={item}&ref=sr_pg_{item}', headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    for item in soup.findAll('a', attrs={'class': 'a-link-normal a-text-normal'}):
        print(f"https://www.amazon.fr{item.get('href')}")

Run Online: Click Here

Upvotes: 3

Related Questions