Reputation: 4353
I have written a function for Amazon that, given a URL, provides me with the title of a product, the price and the rating. This works nicely if I give it one URL in string format. I want to use this function, say it's called AmazonCrawler
, in order to scrape one entire product category from the website, not just a single product. How can I do this?
EDIT:
Here is an example page that I would like to scrape : Amazon TV Category. If I look at the page source, I find:
<script type='text/javascript'>var ue_t0=ue_t0||+new Date();</script>
<!-- sp:feature:cs-optimization -->
<meta http-equiv='x-dns-prefetch-control' content='on'>
<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">
<link rel="dns-prefetch" href="https://m.media-amazon.com">
<link rel="dns-prefetch" href="https://completion.amazon.com">
<script type='text/javascript'>
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {
I am interested in a way of finding all the URLs of all smart TVs on the amazon website. Is there an automated way of doing this?
Upvotes: 0
Views: 1210
Reputation: 287
Write the below peace of code in console of amazon.in website and you will get all the categories available on Amazon.
let d = document.getElementById('searchDropdownBox');
Array.from(d).forEach(element => {
console.log(element.value.replace("search-alias=", ""));
});
Upvotes: 0
Reputation: 84465
You want a selector which targets all the img with src ending with .jpg but also need to exclude a couple of other earlier matches. The use of :not
and of preceeding .a-row
does this. Finally, you need to use set to clean to unique items.
import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
r = requests.get('https://www.amazon.es/b/ref=sv_ap_arrow_ce_4_1_1_1?node=934359031', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
images = set(i['src'] for i in soup.select('.a-row img[src$=jpg]:not(.bxc-grid__row:nth-child(1) img[src$=jpg])'))
pprint(images)
Upvotes: 0
Reputation: 688
If you use the google inspector you'll find the href on the images pointing to the URLs you want. For example, the first Samsum TV you find has its href at the following Xpath:
/html/body/div[1]/div[2]/div[2]/div[1]/div[3]/div[2]/div[2]/ul/li[1]/span/div/a
from here you need to find a way to generalize the search
Upvotes: 2