dirtyw0lf
dirtyw0lf

Reputation: 1958

Extracting from script - beautiful soup

How would the value for the "tier1Category" be extracted from the source of this page? https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product

soup.find('script') 

returns only a subset of the source, and the following returns another source within that code.

json.loads(soup.find("script", type="application/ld+json").text)

Upvotes: 1

Views: 108

Answers (3)

gregory
gregory

Reputation: 12885

Bitto and I have similar approaches to this, however I prefer to not rely on knowing which script contains the matching pattern, nor the structure of the script.

import requests
from collections import abc
from bs4 import BeautifulSoup as bs

def nested_dict_iter(nested):
    for key, value in nested.items():
        if isinstance(value, abc.Mapping):
            yield from nested_dict_iter(value)
        else:
            yield key, value

r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
for script in soup.find_all('script'):
    if 'tier1Category' in script.text:
        j = json.loads(script.text[str(script.text).index('{'):str(script.text).rindex(';')])
        for k,v in list(nested_dict_iter(j)):
             if k == 'tier1Category':
                 print(v)

Upvotes: 2

Bitto
Bitto

Reputation: 8205

Here are the steps I used to get the output

  • use find_all and get the 10th script tag. This script tag contains the tier1Category value.

  • Get the script text from the first occurrence of { and till last occurrence of ; . This will give us a proper json text.

  • Load the text using json.loads

  • Understand the structure of the json to find how to get the tier1Category value.

Code:

import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = BeautifulSoup(r.text, 'html.parser')
script_text=soup.find_all('script')[9].text
start=str(script_text).index('{')
end=str(script_text).rindex(';')
proper_json_text=script_text[start:end]
our_json=json.loads(proper_json_text)
print(our_json['product']['results']['productInfo']['tier1Category'])

Output:

Medicines & Treatments

Upvotes: 2

QHarr
QHarr

Reputation: 84455

I think you can use an id. I assume tier 1 is after shop in the navigation tree. Otherwise, I don't see that value in that script tag. I see it in an ordinary script (without the script[type="application/ld+json"] ) tag but there are a lot of regex matches for tier 1

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
data = soup.select_one("#bdCrumbDesktopUrls_0").text
print(data)

Upvotes: 0

Related Questions