Reputation: 164
I have this code:
product_url = 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees'
res = requests.get(product_url, headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
product = soup.find('main', {'id': 'main-content'})
details = product.find('script')
data = json.loads(details.string)
which gives this output:
<script>
__metadata.product = {
id: "W21-203921",
sku: "W21-203921",
ph1: 'SOFTGOODS',
ph2: 'BASIC FLEECE AND TEE',
ph3: 'LS TEES',
ph4: '',
upc: '190450612509',
ean: '9009521451408',
brand: "Burton",
category: "womens-tees",
primaryCategory: "womens-sale-sweaters-shirts",
currency: "USD",
gender: "Unisex",
label: "Burton Elite Long Sleeve T-Shirt",
name: "Burton Elite Long Sleeve T-Shirt"
};
__metadata.criteo = {
pageType: 'ProductPage'
};
</script>
Now I want to extract some of this data like id, brand, category, and name.
I have looked at pretty much every thread on this forum with very similar questions and tried their solution, and nothing ever works. Most of them do something along the lines of data = json.loads(details) in various ways and none of them seems to work. The most common errors I get are:
json.decoder.JSONDecodeError: Expecting value: line 2 column 9 (char 9)
or
TypeError: the JSON object must be str, bytes or bytearray, not Tag
Upvotes: 3
Views: 328
Reputation: 28565
It'll be far easier and more robust to just get the data from the ajax format. Just add that in to the params
parameter. Then you can pull out whatever you want from the json format/dictionary. Works for the other url you provided in the comments too.
import requests
url = 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees'
payload = {'format':'ajax'}
jsonData = requests.get(url, params=payload).json()
Output:
print(jsonData['data']['products'][0])
{'id': 'W21-203921', 'hideOutOfStockVariants': True, 'brand': 'Burton', 'name': 'Burton Elite Long Sleeve T-Shirt', 'subtitle': '100% Organic Cotton Long Sleeve Graphic T Shirt', 'shortDescription': "A comfortable long sleeve T-shirt that's an unsung favorite for social hour and Sunday in the park.", 'gender': 'Unisex', 'season': 'W21', 'isBoard': False, 'hasSizeChart': True, 'hasSizeFinder': False, 'selectedVariations': {'variationColor': '', 'variationSize': ''}, 'links': {'master': 'https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html', 'variations': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetVariationJSON?pid=W21-203921', 'manual': '/us/en/help/manuals.html', 'yotpoAPI': 'https://api.yotpo.com/v1/widget/AbBl1exDWS4rzXsg73rzUKlzUOo10aeMXRkIGHVG/products/W21-203921/reviews?per_page=0', 'tech': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetTechFeaturesJSON?pids=W21-203921', 'recommendations': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetRecommendationsJSON?pids=W21-203921', 'ultimateSetup': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetRecommendationsJSON?pids=W21-203921', 'dynamicslots': '/on/demandware.store/Sites-Burton_NA-Site/en_US/Slot-GetDynamicSlots?pid=W21-203921'}, 'variationValueCount': {'variationColor': 4, 'variationSize': 7}, 'finePrint': [], 'images': {'type': 'PRODUCT_LEVEL', 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4U.png'}}, {'id': '_3W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_3W.png'}}, {'id': '_4M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4M.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_1.png'}}], 'variationImageData': [{'variationColorID': '20392102001', 'display': {'category': {'primary': '_4U', 'focus': '_1'}}, 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4U.png'}}, {'id': '_3W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_3W.png'}}, {'id': '_4M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_4M.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102001_1.png'}}]}, {'variationColorID': '20392102300', 'display': {'category': {'primary': '_4U', 'focus': '_1'}}, 'views': [{'id': '_4U', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_4U.png'}}, {'id': '_3M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_3M.png'}}, {'id': '_4W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_4W.png'}}, {'id': '_5M', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_5M.png'}}, {'id': '_6W', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_6W.png'}}, {'id': '_1', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392102300_1.png'}}]}, {'variationColorID': '20392103200', 'display': {'category': {'primary': '_4'}}, 'views': [{'id': '_4', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_4.png'}}, {'id': '_3', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_3.png'}}, {'id': '_5', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_5.png'}}, {'id': '_6', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103200_6.png'}}]}, {'variationColorID': '20392103400', 'display': {'category': {'primary': '_3'}}, 'views': [{'id': '_3', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_3.png'}}, {'id': '_4', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_4.png'}}, {'id': '_5', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_5.png'}}, {'id': '_6', 'active': True, 'type': 'image', 'masterLevel': False, 'template': None, 'format': None, 'url': {'base': 'https://www.burton.com/static/product/W21/20392103400_6.png'}}]}]}, 'ean': '9009521451408', 'upc': '190450612509', 'badges': '', 'category': 'womens-tees', 'primaryCategory': 'womens-sale-sweaters-shirts', 'hasUltimateSetup': False, 'ph1': 'SOFTGOODS', 'ph2': 'BASIC FLEECE AND TEE', 'ph3': 'LS TEES', 'ph4': '', 'videoID': '', 'videoPoster': '', 'videoVertical': '', 'spectrumObjects': False, 'scrollingText': False, 'cartSpecialCalloutMessage': False, 'disableEcommerce': False}
Update:
To get price and stock, you need to pull out the product ID from that first response, then make a new request:
import requests
import pandas as pd
urls = ['https://www.burton.com/us/en/p/burton-elite-long-sleeve-tshirt/W21-203921.html?cgid=womens-tees','https://www.burton.com/us/en/p/girls-burton-chicklet-flat-top-snowboard/W21-107341.html']
payload = {'format':'ajax'}
productID_list = []
for url in urls:
jsonData = requests.get(url, params=payload).json()
productID = jsonData['data']['masterID']
productID_list.append(productID)
stock = []
for productID in productID_list:
prod_url = 'https://www.burton.com/on/demandware.store/Sites-Burton_NA-Site/en_US/Product-GetVariationJSON'
payload = {'pid':productID,
'pricing':''}
productData = requests.get(prod_url, params=payload).json()
for each in productData['data']['variations']['variationValues']:
row = {}
row['name'] = each['name']
row['color'] = each['variationColor']['displayName']
row['size'] = each['variationSize']['displayName']
row['standard_price'] = each['price']['standardPriceUnformatted']
row['sale_price'] = each['price']['salePriceUnformatted']
row['isOnSale'] = each['price']['isOnSale']
row['available'] = each['status']['available']
row['inStock'] = each['status']['meta']['type']
stock.append(row)
df = pd.DataFrame(stock)
Output:
print (df.to_string())
name color size standard_price sale_price isOnSale available inStock
0 Burton Elite Long Sleeve T-Shirt True Black L 39.95 False True IN_STOCK
1 Burton Elite Long Sleeve T-Shirt True Black M 39.95 False True IN_STOCK
2 Burton Elite Long Sleeve T-Shirt True Black S 39.95 False True IN_STOCK
3 Burton Elite Long Sleeve T-Shirt True Black XL 39.95 False True IN_STOCK
4 Burton Elite Long Sleeve T-Shirt True Black XS 39.95 False True IN_STOCK
5 Burton Elite Long Sleeve T-Shirt True Black XXL 39.95 False True IN_STOCK
6 Burton Elite Long Sleeve T-Shirt True Black XXS 39.95 False True IN_STOCK
7 Burton Elite Long Sleeve T-Shirt Martini Olive L 39.95 False True IN_STOCK
8 Burton Elite Long Sleeve T-Shirt Martini Olive M 39.95 False True IN_STOCK
9 Burton Elite Long Sleeve T-Shirt Martini Olive S 39.95 False True IN_STOCK
10 Burton Elite Long Sleeve T-Shirt Martini Olive XL 39.95 False True IN_STOCK
11 Burton Elite Long Sleeve T-Shirt Martini Olive XS 39.95 False True IN_STOCK
12 Burton Elite Long Sleeve T-Shirt Martini Olive XXL 39.95 False True IN_STOCK
13 Burton Elite Long Sleeve T-Shirt Martini Olive XXS 39.95 False True IN_STOCK
14 Burton Elite Long Sleeve T-Shirt True Penny L 39.95 27.96 True True IN_STOCK
15 Burton Elite Long Sleeve T-Shirt True Penny M 39.95 27.96 True True IN_STOCK
16 Burton Elite Long Sleeve T-Shirt True Penny S 39.95 27.96 True False BACKORDER
17 Burton Elite Long Sleeve T-Shirt True Penny XL 39.95 27.96 True True IN_STOCK
18 Burton Elite Long Sleeve T-Shirt True Penny XS 39.95 27.96 True False NOT_AVAILABLE
19 Burton Elite Long Sleeve T-Shirt True Penny XXL 39.95 27.96 True True IN_STOCK
20 Burton Elite Long Sleeve T-Shirt True Penny XXS 39.95 27.96 True True IN_STOCK
21 Burton Elite Long Sleeve T-Shirt Lapis Blue L 39.95 27.96 True False NOT_AVAILABLE
22 Burton Elite Long Sleeve T-Shirt Lapis Blue M 39.95 27.96 True False BACKORDER
23 Burton Elite Long Sleeve T-Shirt Lapis Blue S 39.95 27.96 True False BACKORDER
24 Burton Elite Long Sleeve T-Shirt Lapis Blue XL 39.95 27.96 True False NOT_AVAILABLE
25 Burton Elite Long Sleeve T-Shirt Lapis Blue XS 39.95 27.96 True True IN_STOCK
26 Burton Elite Long Sleeve T-Shirt Lapis Blue XXL 39.95 27.96 True False BACKORDER
27 Burton Elite Long Sleeve T-Shirt Lapis Blue XXS 39.95 27.96 True True IN_STOCK
28 Girls' Burton Chicklet Flat Top Snowboard 80 80 199.95 False False BACKORDER
29 Girls' Burton Chicklet Flat Top Snowboard 90 90 199.95 False False BACKORDER
30 Girls' Burton Chicklet Flat Top Snowboard 100 100 199.95 False False BACKORDER
31 Girls' Burton Chicklet Flat Top Snowboard 110 110 199.95 False False BACKORDER
32 Girls' Burton Chicklet Flat Top Snowboard 115 115 199.95 False False BACKORDER
33 Girls' Burton Chicklet Flat Top Snowboard 120 120 199.95 False True IN_STOCK
34 Girls' Burton Chicklet Flat Top Snowboard 125 125 199.95 False True IN_STOCK
35 Girls' Burton Chicklet Flat Top Snowboard 130 130 199.95 False True IN_STOCK
Upvotes: 2
Reputation: 56895
I'm leaving the below answer for posterity, but this approach is better. Moral of the story: check the XHR requests and see if you can circumvent the string parsing entirely by working with their API.
As I wrote in a comment, there are so many different assumptions you could make about this data and equally many strategies you could use to extract it.
Which you use depends on many factors: is this data format likely to change? Is it a one-off scrape or something you need to be resilient to as many future modifications as possible? If the latter, which future modifications seem most likely based on your knowledge of the site?
Given that these questions weren't addressed, I assume you just want to parse it into a dict as simply as possible without making all sorts of futureproofing assumptions.
You can use:
import json
import re
chunk = re.search(r"\{[^}]+", html).group().replace("'", '"')
data = json.loads(re.sub(r"(\w+):", r'"\1":', chunk) + "}")
Which assumes no braces are within the JS object, no colons are within the strings, etc.
As an example of the above warning, OP has replied that the key label: "Girls' Burton Chicklet Flat Top Snowboard",
breaks the regex because it has a '
in it that is replaced with an unescaped "
.
This can be fixed for this case by assuming that the '
is not followed by a "
on the same line:
chunk = re.sub(r'\'(?![^\n"]*")', '"', re.search(r"\{[^}]+", html).group())
data = json.loads(re.sub(r"(\w+):", r'"\1":', chunk) + "}")
...but this merely replaces one set of assumptions with another, and it's easy to concoct a scenario that breaks this pattern as well. If the use case is scraping millions of products, it's almost inevitable that something unanticipated will arise and the patterns shown here will need further adaptation. This post is a proof-of-concept and can't purport to parse arbitrary formats, so it's an exercise for the reader to make further adjustments.
Upvotes: 2