Reputation: 77
im trying to extract a simple title of a product from amazon.com using the id that the span which contains the title has. this is what i wrote:
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
title = soup.find(id='productTitle').get_text()
print(title)
and i keep getting either none or empty list or i cant extract anything and gives me an attribute error saying that the object i used doesnt have an attribute get_text, which raised another question which is how to get the text of this simple span. i really appreciate it if someone could figure it out and help me. thanks in advance.
Upvotes: 0
Views: 424
Reputation: 2159
Running your code and checking the res
value, you would get a 503 error. This means that the Service is unavailable (htttp status 503).
Following up, using this SO post, seems that adding the headers={"User-Agent":"Defined"}
to the get
requests does work.
res = requests.get(url, headers={"User-Agent": "Defined"})
Will return a 200 (OK) response.
Amazon actually checks for web scrapers, and even though you will get a page back, printing the result (print(soup)
) will likely show you the following:
<body>
<!--
To discuss automated access to Amazon data please contact [email protected].
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
...
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
But you can use selenium to simulate a human. A minimal working example for me was the following:
import selenium.webdriver
url = 'http://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
driver = selenium.webdriver.Firefox()
driver.get(url)
title = driver.find_element_by_id('productTitle').text
print(title)
Which prints out
Acer SB220Q bi 21.5 Inches Full HD (1920 x 1080) IPS Ultra-Thin Zero Frame Monitor (HDMI & VGA Port), Black
A small thing when using selenium is that it is much slower than the requests
library. Also a new screen will pop-up that shows the page, but luckily we can do something about that screen by using a headless
driver.
Upvotes: 2