Reputation: 468
I am trying to scrape a site which has links to Amazon with Python using these frameworks - selenium, beautiful soup.
My aim is to scrape the Following Amazon products details --> Title, Price, Description, First Review
But I am having a hard time with Beautiful selectors I tried many combinations but I either get a null output or Error, Unfortunately Not so Pro. The main problem is that Beautiful soup doesn't have XPath selectors (AFAIK). Should I move to scrapy for this task, or is scrapy is too overwhelming for this simple scraper?
This is for the first product I will iterate this later
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver.get('https://www.example.com')
driver.get('https://www.example.com')
first_article = driver.find_element_by_css_selector('div.button')
first_article.click()
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
# perform the operation
After that I have to select the respective path but how to do them? In Xpath is something like this,
Title = '//h1[@id="title"]//text()'
Price = '//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()'
Category = //a[@class="a-link-normal a-color-tertiary"]//text()'
But product details and the path to the first review only I can't yet. Beautiful Soup find_all selectors won't be helpful here I think.
Upvotes: 0
Views: 2859
Reputation: 70
Amazon has anti-scraping mechanisms in place that if it detects scraping it will use a captcha on the scraper so your issue is that it’s returning the html for the captcha and you are not finding anything.
The only way reliable way to scrape amazon will be to use a headless version of Selenium.
Upvotes: 1
Reputation: 66
You can just use BeautifulSoup for that, it's not really hard, and if you are interested I think there is api's for that.
Selenium is used more often to click in buttons, and this can slow down your program, because for each button click you will need to wait for the load page, and for what you need to do, you must have speed because, it is alot of links :D.
There is a good documentation about BeautifulSoup: http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python
Good Api for python: aws.amazon.com/python
Upvotes: 0
Reputation: 96
If your purpose is just scraping the website, go with just BeautifulSoup. This would save you some execution time and extra lines of code as compared to using Selenium.
BeautifulSoup has a function named findNext from current element directed childern,so:
Try something like this-
import bs4
import requests
res = requests.get(url)
soup = bs4.BeautifulSoup(self.res.text, "lxml") #lxlm parser
text = soup.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
This is similar to xpath -
div[class=class_value]/div[id=id_value]
Upvotes: 0
Reputation: 269
Try to use selenium it supports xpath selectors. driver.find_element_by_xpath(Title) # Example
Upvotes: 0