Reputation: 3
Trying to gain the sizes from here.
The content I want:
However I am receiving:
[<div class="options" id="productSizeStock">
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
<button class="btn options-loading" disabled="" type="button">
</button>
I also tried using requests-html
to see if it was a javascript rendering issue. But I was just receiving empty values.
Here is my code:
import requests
import randomheaders
from bs4 import BeautifulSoup
proxy = {'''PROXY'''}
while True:
try:
source = requests.get("https://www.size.co.uk/product/grey-nike-air-max-98-se/132114/", proxies= proxy, headers=randomheaders.LoadHeader(),timeout=30).text
soup = BeautifulSoup(source, features = "lxml")
print(soup.find_all("div", class_="options"))
except Exception as e:
print(e)
time.sleep(5)
Upvotes: 0
Views: 6102
Reputation: 4783
from a technical point of view your code is correct. As this website uses Javascript to render itself, the size is store on a different URL, which is the following:
https://www.size.co.uk/product/grey-nike-air-max-98-se/132114/stock
as you can see you just have to add /stock to your initial URL.
That being said, try replacing this:
source = requests.get("https://www.size.co.uk/product/grey-nike-air-max-98-se/132114/", proxies= proxy, headers=randomheaders.LoadHeader(),timeout=30).text
soup = BeautifulSoup(source, features = "lxml")
print(soup.find_all("div", class_="options"))
with:
source = requests.get("https://www.size.co.uk/product/grey-nike-air-max-98-se/132114/stock", proxies= proxy, headers=randomheaders.LoadHeader(),timeout=30).text
soup = BeautifulSoup(source, features = "lxml")
sizes = [x["title"].replace("Select Your UK Size ","") for x in soup.find_all("button",{"data-e2e":"product-size"})]
print(sizes)
Where sizes
is a list containing all of the sizes and has the following output:
['6', '7', '7.5', '8', '8.5', '9', '9.5', '10', '10.5', '11', '11.5', '12']
Hope this helps!
Upvotes: 1
Reputation: 2679
It is probably because the information you are searching for is added dynamically by a client-side script (JS in this case). I don't see an easy way to get the information simply with requests
if it's the case, probably you should analyse better the page scripting and if really motivated perform the proper AJAX
requests.
So, to recap, you are not getting the correct results because any JS generated content has to be rendered on the document. When you fetch the HTML page, you fetch only the initial document.
A possible solution (the solution is for Python's 3.6 only) consists at using requests-HTML instead of requests:
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
Install requests-html: pipenv install requests-html
Make a request to the page's url:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(a_page_url)
Render the response to get the Javascript generated bits:
r.html.render()
This module offers scraping and JavaScript Support, that is exactly what you need.
Upvotes: 3