Reputation: 1958
How would the value for the "tier1Category" be extracted from the source of this page? https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product
soup.find('script')
returns only a subset of the source, and the following returns another source within that code.
json.loads(soup.find("script", type="application/ld+json").text)
Upvotes: 1
Views: 108
Reputation: 12885
Bitto and I have similar approaches to this, however I prefer to not rely on knowing which script contains the matching pattern, nor the structure of the script.
import requests
from collections import abc
from bs4 import BeautifulSoup as bs
def nested_dict_iter(nested):
for key, value in nested.items():
if isinstance(value, abc.Mapping):
yield from nested_dict_iter(value)
else:
yield key, value
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
for script in soup.find_all('script'):
if 'tier1Category' in script.text:
j = json.loads(script.text[str(script.text).index('{'):str(script.text).rindex(';')])
for k,v in list(nested_dict_iter(j)):
if k == 'tier1Category':
print(v)
Upvotes: 2
Reputation: 8205
Here are the steps I used to get the output
use find_all and get the 10th script tag. This script tag contains the tier1Category
value.
Get the script text from the first occurrence of {
and till last occurrence of ;
. This will give us a proper json text.
Load the text using json.loads
Understand the structure of the json to find how to get the tier1Category
value.
Code:
import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = BeautifulSoup(r.text, 'html.parser')
script_text=soup.find_all('script')[9].text
start=str(script_text).index('{')
end=str(script_text).rindex(';')
proper_json_text=script_text[start:end]
our_json=json.loads(proper_json_text)
print(our_json['product']['results']['productInfo']['tier1Category'])
Output:
Medicines & Treatments
Upvotes: 2
Reputation: 84455
I think you can use an id. I assume tier 1 is after shop
in the navigation tree. Otherwise, I don't see that value in that script tag. I see it in an ordinary script (without the script[type="application/ld+json"] ) tag but there are a lot of regex matches for tier 1
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
data = soup.select_one("#bdCrumbDesktopUrls_0").text
print(data)
Upvotes: 0