merlin
merlin

Reputation: 2927

How to pull json data from HTML into a python dictionary?

I am trying to extract structured data within a json statement inside a html page. Therefore I retrieved the html and got the json via xpath:

json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first())

The data starts like this:

response.xpath('//*[@id="product"]/script[2]/text()').extract_first()
"\r\ndataLayer.push({\r\n\t'event': 'EECproductDetailView',\r\n\t'ecommerce': {\r\n\t\t'detail': {\r\n\r\n\t\t\t'products': [{\r\n\t\t\t\t'id': '14171171',\r\n\t\t\t\t'name': 'Gingium 120mg',\r\n\t\t\t\t'price': '27.9',\r\n\r\n\t\t\t\t'brand': 'Hexal AG',\r\n\r\n\r\n\t\t\t\t'variant': 'Filmtabletten, 60 Stück, N2',\r\n\r\n\r\n\t\t\t\t'category': 'gedaechtnis-konzentration'\r\n\t\t\t}]\r\n\t\t}\r\n\t}\r\n});\r\n"

Sample structured json:

<script>
dataLayer.push({
    'event': 'EECproductDetailView',
    'ecommerce': {
        'detail': {

            'products': [{
                'id': '14122171',
                'name': 'test',
                'price': '27.9'
            }]
        }
    }
});
</script>

The error message is:

>>> json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first())
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2)

I also tried to decode:

>>> json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first().decode("utf-8"))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>>

How can I pull the product data into a python dictionary?

Upvotes: 0

Views: 324

Answers (1)

Amin Rezaei
Amin Rezaei

Reputation: 376

Many issues exist in your approach that I will discuss them below. You want to parse the value passed to push function as json and you have this as input:

dataLayer.push({
    'event': 'EECproductDetailView',
    'ecommerce': {
        'detail': {

            'products': [{
                'id': '14122171',
                'name': 'test',
                'price': '27.9'
            }]
        }
    }
});

Issues:

  1. This data is raw. You shouldn't pass it directly to json.loads, to resolve this try to grab {'event' .... } from your string via regex or some string interpolation. For example if your data format is always like this and other javascripts are not defined in scope via {} then grab the index of first { and last } and do substring to get the main data.
    1. This data contains ' as string indicators, but json standard use double quotes ". You should take care of replacing them as well.

After resolving issues you can use json.loads to parse your input.

Upvotes: 1

Related Questions