Abhijeet Pal
Abhijeet Pal

Reputation: 468

Bs4 Selectors: Scrape Amazon using Beautiful Soup

I am trying to scrape a site which has links to Amazon with Python using these frameworks - selenium, beautiful soup.

My aim is to scrape the Following Amazon products details --> Title, Price, Description, First Review

But I am having a hard time with Beautiful selectors I tried many combinations but I either get a null output or Error, Unfortunately Not so Pro. The main problem is that Beautiful soup doesn't have XPath selectors (AFAIK). Should I move to scrapy for this task, or is scrapy is too overwhelming for this simple scraper?

This is for the first product I will iterate this later

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver.get('https://www.example.com')
driver.get('https://www.example.com')
first_article = driver.find_element_by_css_selector('div.button')
first_article.click()
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
# perform the operation

After that I have to select the respective path but how to do them? In Xpath is something like this,

Title = '//h1[@id="title"]//text()'

Price = '//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()'

Category = //a[@class="a-link-normal a-color-tertiary"]//text()'

But product details and the path to the first review only I can't yet. Beautiful Soup find_all selectors won't be helpful here I think.

Upvotes: 0

Views: 2859

Answers (4)

Dennis Cafiero
Dennis Cafiero

Reputation: 70

Amazon has anti-scraping mechanisms in place that if it detects scraping it will use a captcha on the scraper so your issue is that it’s returning the html for the captcha and you are not finding anything.

The only way reliable way to scrape amazon will be to use a headless version of Selenium.

Upvotes: 1

You can just use BeautifulSoup for that, it's not really hard, and if you are interested I think there is api's for that.

Selenium is used more often to click in buttons, and this can slow down your program, because for each button click you will need to wait for the load page, and for what you need to do, you must have speed because, it is alot of links :D.

There is a good documentation about BeautifulSoup: http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python

Good Api for python: aws.amazon.com/python

Upvotes: 0

Saurav
Saurav

Reputation: 96

If your purpose is just scraping the website, go with just BeautifulSoup. This would save you some execution time and extra lines of code as compared to using Selenium.

BeautifulSoup has a function named findNext from current element directed childern,so:

Try something like this-

    import bs4 
    import requests

    res = requests.get(url)
    soup = bs4.BeautifulSoup(self.res.text, "lxml")    #lxlm parser
    text = soup.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a') 

This is similar to xpath -

div[class=class_value]/div[id=id_value]

Upvotes: 0

Goran
Goran

Reputation: 269

Try to use selenium it supports xpath selectors. driver.find_element_by_xpath(Title) # Example

Upvotes: 0

Related Questions