Reputation: 2927
I am trying to extract structured data within a json statement inside a html page. Therefore I retrieved the html and got the json via xpath:
json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first())
The data starts like this:
response.xpath('//*[@id="product"]/script[2]/text()').extract_first()
"\r\ndataLayer.push({\r\n\t'event': 'EECproductDetailView',\r\n\t'ecommerce': {\r\n\t\t'detail': {\r\n\r\n\t\t\t'products': [{\r\n\t\t\t\t'id': '14171171',\r\n\t\t\t\t'name': 'Gingium 120mg',\r\n\t\t\t\t'price': '27.9',\r\n\r\n\t\t\t\t'brand': 'Hexal AG',\r\n\r\n\r\n\t\t\t\t'variant': 'Filmtabletten, 60 Stück, N2',\r\n\r\n\r\n\t\t\t\t'category': 'gedaechtnis-konzentration'\r\n\t\t\t}]\r\n\t\t}\r\n\t}\r\n});\r\n"
Sample structured json:
<script>
dataLayer.push({
'event': 'EECproductDetailView',
'ecommerce': {
'detail': {
'products': [{
'id': '14122171',
'name': 'test',
'price': '27.9'
}]
}
}
});
</script>
The error message is:
>>> json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first())
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2)
I also tried to decode:
>>> json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first().decode("utf-8"))
Traceback (most recent call last):
File "<console>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>>
How can I pull the product data into a python dictionary?
Upvotes: 0
Views: 324
Reputation: 376
Many issues exist in your approach that I will discuss them below. You want to parse the value passed to push function as json and you have this as input:
dataLayer.push({
'event': 'EECproductDetailView',
'ecommerce': {
'detail': {
'products': [{
'id': '14122171',
'name': 'test',
'price': '27.9'
}]
}
}
});
Issues:
json.loads
, to resolve this try to grab {'event' .... }
from your string via regex or some string interpolation. For example if your data format is always like this and other javascripts are not defined in scope via {}
then grab the index of first {
and last }
and do substring to get the main data.
'
as string indicators, but json standard use double quotes "
. You should take care of replacing them as well.After resolving issues you can use json.loads
to parse your input.
Upvotes: 1