Qubix
Qubix

Reputation: 4353

Crawl an entire category of products from Amazon using BeautifulSoup

I have written a function for Amazon that, given a URL, provides me with the title of a product, the price and the rating. This works nicely if I give it one URL in string format. I want to use this function, say it's called AmazonCrawler, in order to scrape one entire product category from the website, not just a single product. How can I do this?

EDIT:

Here is an example page that I would like to scrape : Amazon TV Category. If I look at the page source, I find:

<script type='text/javascript'>var ue_t0=ue_t0||+new Date();</script>
<!-- sp:feature:cs-optimization -->
<meta http-equiv='x-dns-prefetch-control' content='on'>
<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">
<link rel="dns-prefetch" href="https://m.media-amazon.com">
<link rel="dns-prefetch" href="https://completion.amazon.com">
<script type='text/javascript'>
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

I am interested in a way of finding all the URLs of all smart TVs on the amazon website. Is there an automated way of doing this?

Upvotes: 0

Views: 1210

Answers (3)

Ravi Ranjan
Ravi Ranjan

Reputation: 287

Write the below peace of code in console of amazon.in website and you will get all the categories available on Amazon.

let d = document.getElementById('searchDropdownBox');
Array.from(d).forEach(element => {
  console.log(element.value.replace("search-alias=", ""));
});

Upvotes: 0

QHarr
QHarr

Reputation: 84465

You want a selector which targets all the img with src ending with .jpg but also need to exclude a couple of other earlier matches. The use of :not and of preceeding .a-row does this. Finally, you need to use set to clean to unique items.

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
    
r = requests.get('https://www.amazon.es/b/ref=sv_ap_arrow_ce_4_1_1_1?node=934359031', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
images = set(i['src'] for i in soup.select('.a-row img[src$=jpg]:not(.bxc-grid__row:nth-child(1) img[src$=jpg])'))
pprint(images)

Upvotes: 0

Giovanni Frison
Giovanni Frison

Reputation: 688

If you use the google inspector you'll find the href on the images pointing to the URLs you want. For example, the first Samsum TV you find has its href at the following Xpath:

/html/body/div[1]/div[2]/div[2]/div[1]/div[3]/div[2]/div[2]/ul/li[1]/span/div/a

enter image description here

from here you need to find a way to generalize the search

Upvotes: 2

Related Questions