Reputation: 168
I am using python's beautifulSoup package to scrape the following page: https://www.nike.com/w/womens-shoes-5e1x6zy7ok
When I use the following code:
data = br.open("https://www.nike.com/w/womens-shoes-5e1x6zy7ok").read()
soup = BS(data)
shoes = soup.find_all('div', {'class':'product-card__body'})
I receive only:
<picture><source media="0" srcset=""/><source media="1" srcset=""/><source media="2" srcset=""/><img alt="Nike Air Max 2090 Women's Shoe" src=""/></picture>
However, if I copy directly from the site's URL, I receive much more information:
<picture><source srcset="product-card__body" media="(min-width: 1024px)"><source srcset="https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/b2bfaf14-ed59-48a7-b8ae-e684b1d605ce/air-max-270-react-se-womens-shoe-6bhhrf.jpg" media="(max-width: 1023px) and (-webkit-min-device-pixel-ratio: 2), (min-resolution: 192dpi)"><source srcset="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/b2bfaf14-ed59-48a7-b8ae-e684b1d605ce/air-max-270-react-se-womens-shoe-6bhhrf.jpg" media="(max-width: 1023px)"><img src="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/b2bfaf14-ed59-48a7-b8ae-e684b1d605ce/air-max-270-react-se-womens-shoe-6bhhrf.jpg" alt="Nike Air Max 270 React SE Women's Shoe"></picture>
How do I use beautifulsoup to obtain the latter information?
Upvotes: 0
Views: 309
Reputation: 195573
The data is loaded via JavaScript from their API. This script will print the initial products on the page:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.nike.com/gb/w/womens-shoes-5e1x6zy7ok'
html_data = requests.get(url).text
data = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))
for p in data['Wall']['products']:
print(p['title'])
print(p['subtitle'])
print(p['price']['currentPrice'], p['price']['currency'])
print(p['colorways'][0]['images']['portraitURL'].replace('w_400', 'w_1920'))
print('-' * 120)
Prints:
Nike Air VaporMax 2020 FK
Women's Shoe
189.95 GBP
https://static.nike.com/a/images/c_limit,w_1920,f_auto/t_product_v1/d4452769-d6ac-4121-8f98-96f7cb9e0f68/image.jpg
------------------------------------------------------------------------------------------------------------------------
Nike Air Max 90
Women's Shoe
114.95 GBP
https://static.nike.com/a/images/c_limit,w_1920,f_auto/t_product_v1/e4182f87-d936-4052-a14a-b3c8bd161a38/image.jpg
------------------------------------------------------------------------------------------------------------------------
NikeCourt Air Zoom GP Turbo
Women's Hard Court Tennis Shoe
124.95 GBP
https://static.nike.com/a/images/c_limit,w_1920,f_auto/t_product_v1/4ec4011a-1c46-42f4-9b4b-ff99fd9592f2/image.jpg
------------------------------------------------------------------------------------------------------------------------
Nike Air Zoom SuperRep Premium
Women's HIIT Class Shoe
114.95 GBP
https://static.nike.com/a/images/c_limit,w_1920,f_auto/t_product_v1/d058f141-eebb-4578-bc87-53867c9ee173/image.jpg
------------------------------------------------------------------------------------------------------------------------
...and so on.
EDIT: TO print products form all pages:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.nike.com/gb/w/womens-shoes-5e1x6zy7ok'
html_data = requests.get(url).text
data = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))
for p in data['Wall']['products']:
print(p['title'])
print(p['subtitle'])
print(p['price']['currentPrice'], p['price']['currency'])
print(p['colorways'][0]['images']['portraitURL'].replace('w_400', 'w_1920'))
print('-' * 120)
next_page = data['Wall']['pageData']['next']
while next_page:
u = 'https://www.nike.com' + next_page
data = requests.get(u).json()
for o in data['objects']:
p = o['productInfo'][0]
print(p['productContent']['title'])
print(p['productContent']['subtitle'])
print(p['merchPrice']['currentPrice'], p['merchPrice']['currency'])
print(p['imageUrls']['productImageUrl'])
print('-' * 120)
next_page = data.get('pages', {'next':''})['next']
Upvotes: 1
Reputation: 875
Try this:
import requests
...
req = requests.get(<your URL>, headers={'User-Agent': <user-agent from your browser>})
if not req.ok:
# Error
soup = BeautifulSoup(req.text)
...
Upvotes: 0