codeananda
codeananda

Reputation: 1310

Unable to scrape drop down menu using BeautifulSoup and Requests

I want to scrape the product pages on Breitling's website for various pieces of information.

Example page: https://www.breitling.com/gb-en/watches/navitimer/b01-chronograph-46/AB0127211C1A1/

I am having trouble scraping the watch's strap material given in the dropdown menu above the "ADD TO BAG" button ('steel 1.4435' in the example's case).

The specific element I want is:

<small class="dd-selected-description dd-desc dd-selected-description-truncated">Steel 1.4435</small>

However, this is not returned in the response to my GET request. The closest element to the <small> tag is a <div> element with id='strap-selector-list'.

However, when calling soup.find(id='strap-selector-list') it shows the <div> as containing nothing.

import requests
from bs4 import BeautifulSoup

url = 'https://www.breitling.com/gb-en/watches/navitimer/b01-chronograph-46/AB0127211C1A1/'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

soup.find(id='strap-selector-list')

returns

<div id="strap-selector-list"></div>

How can I get there to be information inside (as is shown when you open the inspector?)

Screenshot of page with inspector open highlighting areas of interest

What I've tried:

  1. Spoofing headers. I copy/pasted all the request headers (apart from cookies) in the Network tab in developer tools. I used them in the GET request (only including changed lines for brevity)
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'dnt': '1',
'referer': 'https://www.breitling.com/gb-en/watches/navitimer/?search%5Bref%5D=&search%5Bsorting%5D=newest',
'sec-fetch-mode': 'navigate, same-origin, cors',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}

r = requests.get(url, headers=headers)

  1. Checked XHR requests. There are only 3 when the page loads. One is for the checkout basket's status, one gives info on retailers such as their store locations and the other is status.php which gives a 404 error.

    If you click the drop-down menu, no XHR requests are sent.

    If you click on any of the items in the drop-down menu, you are taken to the product page for that item.

  2. Using different parsers e.g. html.parser makes no difference

  3. Adding in cookies to the headers and performing a normal GET request, also no difference
  4. First creating session = requests.Session() and doing r = session.get(url) with and without headers=headers also doesn't work.

Any help is much appreciated!

Upvotes: 2

Views: 690

Answers (1)

balderman
balderman

Reputation: 23815

The data you are looking for resides under a script element.

All you need to do is to load the JSON that is returned as the script body and traverse the dict.

import requests
from bs4 import BeautifulSoup
import json
import pprint

url = 'https://www.breitling.com/gb-en/watches/navitimer/b01-chronograph-46/AB0127211C1A1/'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html')

script = soup.find(id='app-reference-versions')
pprint.pprint(json.loads(script.contents[0]))

output

https://pastebin.com/kGhMQt61

Upvotes: 1

Related Questions