Reputation: 629
So I have been getting some mixed answers here. Either to run with regex or not.
what I am trying to do is that I am trying to grab a specific value (The json of spConfig) in the html which is:
<script type="text/x-magento-init">
{
"#product_addtocart_form": {
"configurable": {
"spConfig": {"attributes":{"93":{"id":"93","code":"color","label":"Color","options":[{"id":"8243","label":"Helloworld","products":["97460","97459"]}],"position":"0"},"148":{"id":"148","code":"codish","label":"Codish","options":[{"id":"4707","label":"12.5","products":[]},{"id":"2724","label":"13","products":[]},{"id":"4708","label":"13.5","products":[]}],"position":"1"}},"template":"EUR <%- data.price %>","optionPrices":{"97459":{"oldPrice":{"amount":121},"basePrice":{"amount":121},"finalPrice":{"amount":121},"tierPrices":[]}},"prices":{"oldPrice":{"amount":"121"},"basePrice":{"amount":"121"},"finalPrice":{"amount":"121"}},"productId":"97468","chooseText":"Choose an Option...","images":[],"index":[]},
"gallerySwitchStrategy": "replace"
}
}
}
</script>
and here is the problem. When scraping the HTML, there is multiply <script type="text/x-magento-init">
but only one spConfig
and I have two question here.
Should I grab the value spConfig using Regex to later use json.loads(spConfigValue) or not? If not then what method should I use to scrape the json value?
If I am supposed to regex. I have been trying to do grab it using \"spConfig\"\: (.*?)
however it is not scraping the json value for me. what am I doing wrong?
Upvotes: 1
Views: 186
Reputation: 84465
In this case, with bs4 4.7.1 + :contains is your friend. You say there is only a single match for that so you can do the following:
from bs4 import BeautifulSoup as bs
import json
html= '''<html>
<head>
<script type="text/x-magento-init">
{
"#product_addtocart_form": {
"configurable": {
"spConfig": {"attributes":{"93":{"id":"93","code":"color","label":"Color","options":[{"id":"8243","label":"Helloworld","products":["97460","97459"]}],"position":"0"},"148":{"id":"148","code":"codish","label":"Codish","options":[{"id":"4707","label":"12.5","products":[]},{"id":"2724","label":"13","products":[]},{"id":"4708","label":"13.5","products":[]}],"position":"1"}},"template":"EUR <%- data.price %>","optionPrices":{"97459":{"oldPrice":{"amount":121},"basePrice":{"amount":121},"finalPrice":{"amount":121},"tierPrices":[]}},"prices":{"oldPrice":{"amount":"121"},"basePrice":{"amount":"121"},"finalPrice":{"amount":"121"}},"productId":"97468","chooseText":"Choose an Option...","images":[],"index":[]},
"gallerySwitchStrategy": "replace"
}
}
}
</script>
</head>
<body></body>
</html>'''
soup = bs(html, 'html.parser')
data = json.loads(soup.select_one('script:contains(spConfig)').text)
Config is then:
data['#product_addtocart_form']['configurable']['spConfig']
with keys:
Upvotes: 1
Reputation: 167
So basically for json use json parser right. ? 🤔 And for yaml use yamel parser 🤔 so in HTML do use HTML parser See some example and also like that will make you life to shine
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
https://docs.python.org/3/library/html.parser.html
Upvotes: 0
Reputation: 1609
No, don't ever use regex for HTML. Use HTML-parsers like BeautifulSoup
instead!
Upvotes: 1