Mr_12
Mr_12

Reputation: 23

Web-scraping with beautifulsoup returns empty list

Could you please help me to get the name of the products?

You can see the path on the picture enter image description here.

Here the name of product is Samsung Galaxy.... and other codes

What I tried

import pandas as pd
import requests # Import the library for sending requests to the server
from bs4 import BeautifulSoup # Import the library for webpage parsing

URL='https://www.amazon.com/s?k=samsung+tablet&crid=3VMSMTMZYOP78&sprefix=samsung+%2Caps%2C273&ref=nb_sb_ss_ts-doa-p_2_8'
req = requests.get(URL) # GET-request

soup = BeautifulSoup(req.text, 'lxml')
soup.find_all('span', attrs={'class_':'a-size-medium a-color-base a-text-normal'})

Issues

I get an empty list. I do not understand why this is the case.

Upvotes: 0

Views: 823

Answers (2)

Arslan Aziz
Arslan Aziz

Reputation: 213

Check these it will work :

from bs4 import BeautifulSoup
import requests
import pandas as pd
products=[]
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
    "Accept-Encoding": "gzip, deflate",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "DNT": "1",
    "Connection": "close",
    "Upgrade-Insecure-Requests": "1",
}
for page in range(1, 5):
    cookies = {'session': '17ab96bd8ffbe8ca58a78657a918558'}
    r = requests.get(
        "https://www.amazon.com/s?k=samsung+tablet&crid=3VMSMTMZYOP78&sprefix=samsung+%2Caps%2C273&ref=nb_sb_ss_ts-doa-p_2_8={page}".format(
            page=page
        ),
        headers=headers,
        cookies =cookies
    )
    soup = BeautifulSoup(r.content, "lxml")
    for d in soup.select(".s-result-item[data-component-type='s-search-result']"):
        name=d.find('h2')
        if name is not None:
            products.append(name.text)
        else:
            products.append("-")

df = pd.DataFrame({'Product Name':products})
print(df)

0

output:

                                        Product Name
0   SAMSUNG Galaxy Tab S7 FE 2021 Android Tablet 1...
1   Samsung Galaxy Tab A7 10.4 Wi-Fi 32GB Silver (...
2   Samsung Galaxy Tab A7 10.4 Wi-Fi 32GB Silver (...
3   Samsung Tab A7 Lite 8.7" Gray 32GB (SM-T220NZA...
4   SAMSUNG Galaxy Tab A 8.0-inch Android Tablet 6...
..                                                ...
83  2020 Samsung Galaxy Tab A7 10.4�� (2000x1200) ...
84  Samsung Galaxy Tab S6 Lite 10.4�� Touchscreen ...
85  Samsung Galaxy Tab S6 Lite 10.4", 64GB Wi-Fi T...
86  SAMSUNG Galaxy Tab S7 11-inch Android Tablet 1...
87  SAMSUNG Galaxy S20 FE 5G Factory Unlocked Andr...

Upvotes: 1

hc_dev
hc_dev

Reputation: 9377

You should probably adjust your URL for scraping.

Test using curl

When I run a curl request using this URL, the responded HTML does not contain the expected <span class="a-size-medium a-color-base a-text-normal">.

curl 'https://www.amazon.com/s?k=samsung+tablet&crid=3VMSMTMZYOP78&sprefix=samsung+%2Caps%2C273&ref=nb_sb_ss_ts-doa-p_2_8' | grep "<span class="

But only following spans:

                                    <span class="a-button a-button-primary a-span12">
                                        <span class="a-button-inner">
            <span class="a-letter-space"></span>
            <span class="a-letter-space"></span>
            <span class="a-letter-space"></span>
            <span class="a-letter-space"></span>

Test the soup

You can also test the soup as HedgeHog commented:

import requests # Import the library for sending requests to the server
from bs4 import BeautifulSoup # Import the library for webpage parsing

url ='https://www.amazon.com/s?k=samsung+tablet&crid=3VMSMTMZYOP78&sprefix=samsung+%2Caps%2C273&ref=nb_sb_ss_ts-doa-p_2_8'
response = requests.get(url) # GET-request

soup = BeautifulSoup(response.text, 'html')  # adjusted from lxml to html
print(soup) # contains span elements but not expected

elements = soup.find_all('span', attrs={'class_':'a-size-medium a-color-base a-text-normal'})
print(elements) # empty list, the expected spans were not found

You will discover a robot-prevention, probably using a captcha to verify a human is using a browser:

<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>

Fun fact: You can copy & paste or write the resulting HTML to a file and open in your browser. It shows the guarding dogs of Amazon: a dog picture indicating an error at Amazon

See also All The Dogs You Can Meet If You're Trying To Get On Amazon But It's Broken

Workaround: passing required request-headers

Further research suggested to add 2 headers to the request (that your browser automatically adds, too):

  • a valid User-Agent (can simulate a specific browser and OS/platform)
  • the Accept-Language (is required by most e-commerce pages to localize content)

In requests you can add them as dictionary like:

HEADERS = ({
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Accept-Language': 'en-US, en;q=0.5'
})
  
response = requests.get(URL, headers=HEADERS)

See:

Upvotes: 1

Related Questions