Reputation: 629

To use Regex or not, to grab json value from a HTML

So I have been getting some mixed answers here. Either to run with regex or not.

what I am trying to do is that I am trying to grab a specific value (The json of spConfig) in the html which is:

<script type="text/x-magento-init">
        {
            "#product_addtocart_form": {
                "configurable": {
                    "spConfig": {"attributes":{"93":{"id":"93","code":"color","label":"Color","options":[{"id":"8243","label":"Helloworld","products":["97460","97459"]}],"position":"0"},"148":{"id":"148","code":"codish","label":"Codish","options":[{"id":"4707","label":"12.5","products":[]},{"id":"2724","label":"13","products":[]},{"id":"4708","label":"13.5","products":[]}],"position":"1"}},"template":"EUR <%- data.price %>","optionPrices":{"97459":{"oldPrice":{"amount":121},"basePrice":{"amount":121},"finalPrice":{"amount":121},"tierPrices":[]}},"prices":{"oldPrice":{"amount":"121"},"basePrice":{"amount":"121"},"finalPrice":{"amount":"121"}},"productId":"97468","chooseText":"Choose an Option...","images":[],"index":[]},
                    "gallerySwitchStrategy": "replace"
                }
            }
        }
    </script>

and here is the problem. When scraping the HTML, there is multiply <script type="text/x-magento-init"> but only one spConfig and I have two question here.

Should I grab the value spConfig using Regex to later use json.loads(spConfigValue) or not? If not then what method should I use to scrape the json value?
If I am supposed to regex. I have been trying to do grab it using \"spConfig\"\: (.*?) however it is not scraping the json value for me. what am I doing wrong?

Upvotes: 1

Answers (3)

QHarr

Reputation: 84465

In this case, with bs4 4.7.1 + :contains is your friend. You say there is only a single match for that so you can do the following:

from bs4 import BeautifulSoup as bs
import json

html= '''<html>
 <head>
  <script type="text/x-magento-init">
        {
            "#product_addtocart_form": {
                "configurable": {
                    "spConfig": {"attributes":{"93":{"id":"93","code":"color","label":"Color","options":[{"id":"8243","label":"Helloworld","products":["97460","97459"]}],"position":"0"},"148":{"id":"148","code":"codish","label":"Codish","options":[{"id":"4707","label":"12.5","products":[]},{"id":"2724","label":"13","products":[]},{"id":"4708","label":"13.5","products":[]}],"position":"1"}},"template":"EUR <%- data.price %>","optionPrices":{"97459":{"oldPrice":{"amount":121},"basePrice":{"amount":121},"finalPrice":{"amount":121},"tierPrices":[]}},"prices":{"oldPrice":{"amount":"121"},"basePrice":{"amount":"121"},"finalPrice":{"amount":"121"}},"productId":"97468","chooseText":"Choose an Option...","images":[],"index":[]},
                    "gallerySwitchStrategy": "replace"
                }
            }
        }
    </script>
 </head>
 <body></body>
</html>'''
soup = bs(html, 'html.parser')
data = json.loads(soup.select_one('script:contains(spConfig)').text)

Config is then:

data['#product_addtocart_form']['configurable']['spConfig']

with keys:

Upvotes: 1

elhay efrat

Reputation: 167

So basically for json use json parser right. ? 🤔 And for yaml use yamel parser 🤔 so in HTML do use HTML parser See some example and also like that will make you life to shine

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

   def handle_endtag(self, tag):
      print("Encountered an end tag :", tag)

  def handle_data(self, data):
      print("Encountered some data  :", data)

 parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
        '<body><h1>Parse me!</h1></body></html>')

https://docs.python.org/3/library/html.parser.html

Upvotes: 0

csabinho

Reputation: 1609

No, don't ever use regex for HTML. Use HTML-parsers like BeautifulSoup instead!

Upvotes: 1

To use Regex or not, to grab json value from a HTML

Answers (3)

Related Questions