why i keep getting none or empty lists when trying to scrape any data with BeautifulSoup in python

im trying to extract a simple title of a product from amazon.com using the id that the span which contains the title has. this is what i wrote:

import requests
from bs4 import BeautifulSoup

url = 'https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
title = soup.find(id='productTitle').get_text()
print(title)

and i keep getting either none or empty list or i cant extract anything and gives me an attribute error saying that the object i used doesnt have an attribute get_text, which raised another question which is how to get the text of this simple span. i really appreciate it if someone could figure it out and help me. thanks in advance.

Upvotes: 0

Answers (1)

Thymen

Reputation: 2159

Problem

Running your code and checking the res value, you would get a 503 error. This means that the Service is unavailable (htttp status 503).

Solution

Following up, using this SO post, seems that adding the headers={"User-Agent":"Defined"} to the get requests does work.

res = requests.get(url, headers={"User-Agent": "Defined"})

Will return a 200 (OK) response.

The Twist

Amazon actually checks for web scrapers, and even though you will get a page back, printing the result (print(soup)) will likely show you the following:

<body>
<!--
        To discuss automated access to Amazon data please contact [email protected].
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->

...

<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>

The counter

But you can use selenium to simulate a human. A minimal working example for me was the following:

import selenium.webdriver

url = 'http://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'

driver = selenium.webdriver.Firefox()
driver.get(url)
title = driver.find_element_by_id('productTitle').text
print(title)

Which prints out

Acer SB220Q bi 21.5 Inches Full HD (1920 x 1080) IPS Ultra-Thin Zero Frame Monitor (HDMI & VGA Port), Black

A small thing when using selenium is that it is much slower than the requests library. Also a new screen will pop-up that shows the page, but luckily we can do something about that screen by using a headless driver.

Upvotes: 2