Reputation:
I Would like to get the title of this amazon product through BeuatifulSoup and requests. When I run this is says :
Traceback (most recent call last):
File "scraper.py", line 15, in <module>
title = soup.find('span', id='productTitle').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
Plese help me
import bs4
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.amazon.de/OnePlus-Smartphone-Almond-Display-Speicher/dp/B07RWL3K1Y/ref=sr_1_2? __mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=oneplus+7+pro&qid=1598088298&sr=8-2'
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', id='productTitle').get_text()
print(title)
Upvotes: 0
Views: 199
Reputation: 1794
The issue is the use of the 'html.parser' as your bs4 parser. Try lxml instead (which will handle broken html more gracefully). The error was trying to tell you that it never found the <span id='productTitle'>
-- we can see it's there, so it's probably a parsing failure related to non-standard HTML.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.de/OnePlus-Smartphone-Almond-Display-Speicher/dp/B07RWL3K1Y/ref=sr_1_2? __mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=oneplus+7+pro&qid=1598088298&sr=8-2'
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'lxml')
title = soup.find('span', id='productTitle').get_text().strip()
print(title)
Output:
OnePlus 7 Pro Smartphone Almond (16,9 cm) AMOLED Display 8 GB RAM + 256 GB Speicher, Triple Kamera (48 MP) Pop-up Kamera (16 MP) – Dual SIM Handy
Upvotes: 1