Reputation: 13
I am trying to scrape an Amazon Alexa Skill: https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1
For now, I am just trying to get the name of the skill (Paypal), but for some reason this is returning an empty list. I have looked at the website's inspect element and I know that it should give me the name so I am not sure what is going wrong. My code is below:
request = Request(skill_url, headers=request_headers)
response = urlopen(request)
response = response.read()
html = response.decode()
soup = BeautifulSoup(html, 'html.parser')
name = soup.find_all("h1", {"class" : "a2s-title-content"})
Upvotes: 1
Views: 267
Reputation: 195573
Try to set User-Agent
and Accept-Language
HTTP headers to prevent the server send you Captcha page:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0',
'Accept-Language': 'en-US,en;q=0.5'
}
url = 'https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1'
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'lxml')
name = soup.find("h1", {"class" : "a2s-title-content"})
print(name.get_text(strip=True))
Prints:
PayPal
Upvotes: 0
Reputation: 1856
The page content is loaded with javascript, so you can't just use BeautifulSoup to scrape it. You have to use another module like selenium
to simulate javascript execution.
Here is an example:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
url='YOUR URL'
driver = webdriver.Firefox()
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
containers = page_soup.find_all("h1", {"class" : "a2s-title-content"})
print(containers)
print(len(containers))
You can also use chrome-driver
or edge-driver
see here
Upvotes: 1