Manuel
Manuel

Reputation: 802

Scraping Dropdown prompts

I'm having some issues trying to get data from a dropdown button and none of the answers in the site (or at least the ones y found) help me.

The website i'm trying to scrape is amazon, for example, 'Nike Shoes'.

When I enter a product that falls into 'Nike Shoes', I may get a product like this:

https://www.amazon.com/NIKE-Flex-2017-Running-Shoes/dp/B072LGTJKQ/ref=sr_1_1_sspa?ie=UTF8&qid=1546518735&sr=8-1-spons&keywords=nike+shoes&psc=1

Where the size and the color comes with the page. So scraping is simple.

The problem comes when I get this type of products:

https://www.amazon.com/NIKE-Lebron-Soldier-Mid-Top-Basketball/dp/B07KJJ52S4/ref=sr_1_3?ie=UTF8&qid=1546518445&sr=8-3&keywords=nike+shoes

Where I have to select a size, and maybe a color, and also the price changes if I select different sizes.

My question is, is it there a way to, for example, access every "shoe size" so I can at least check the price for that size?

If the page had some sort of list with the sizes within the source code it wouldn't be that hard, but the page changes when I select the size and no "list" of shoe sizes appears on the source (also the URL doesn't change).

Upvotes: 0

Views: 189

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21446

Most ecommerce websites deal with variants by embedding json into html and loading appropriate selection with javascript. So once you scrape html you most likely have all of the variant data.

In your case you'd have shoe sizes, their prices etc embeded in html body. If you search unique enough variant name you can see some json in the body:

enter image description here

Now you need to:

  1. Identify where it json part is:

    It usually is somewhere in <script> tags or as data-<something> attribute of any tag.

  2. Extract json part:

    If it's embedded into javascript directly you can clean extract it with regex:

    script = response.xpath('//script/text()').extract_frist()
    import re
    # capture everything between {}
    data = re.findall(script, '(\{.+?\}_') 
    
  3. Load the json as dict and parse the tree, e.g.:

    import json
    d = json.loads(data[0])
    d['products'][0]
    

Upvotes: 2

Related Questions