Decoding HTML character entities in JSON

Question

When I do this:

s = response.xpath('//meta[@id="_bootstrap-neighborhood_card"]').extract()

what I get back is:

Which is clearly JSON but it's encoded (as you can see). I tried urllib.unquote but that throws an error. AttributeError: 'list' object has no attribute 'split'

I was hoping to not have to resort to using a regex to do the URL decoding. What can I do (besides using a regex) to make this valid JSON?

mhawke · Accepted Answer

You can decode using json.loads(), however, you need to get at the JSON string contained in the content attribute of tag.

You can make multiple calls to xpath() to drill into the attributes of the selected tag:

meta = response.xpath('//meta[@id="_bootstrap-neighborhood_card"]')
content = meta.xpath('@content').extract_first()
data = json.loads(content)

Or you can do it in one go:

content = response.xpath('//meta[@id="_bootstrap-neighborhood_card"]').xpath('@content').extract_first()
data = json.loads(content)
from pprint import pprint
pprint(data)

Output

{u'hosting': {u'id': 2256573,
              u'offset_lat': 39.04258923718809,
              u'offset_lng': -95.69083697887662},
 u'map_url': u'https://maps.googleapis.com/maps/api/staticmap?markers=%2C&size&zoom=14',
 u'neighborhood_basic_info': None,
 u'neighborhood_breadcrumb_details': [{u'link': u'Southwest Fillmore Street,',
                                       u'link_route': u'/s/Southwest-Fillmore-Street-Topeka--KS',
                                       u'link_text': u'Southwest Fillmore Street,',
                                       u'search_text': u'Southwest Fillmore Street Topeka, KS'},
                                      {u'link': u'Topeka,',
                                       u'link_route': u'/s/Topeka--KS',
                                       u'link_text': u'Topeka,',
                                       u'search_text': u'Topeka, KS'},
                                      {u'link': u'Kansas,',
                                       u'link_route': u'/s/Kansas--United-States',
                                       u'link_text': u'Kansas,',
                                       u'search_text': u'Kansas, United States'},
                                      {u'link': u'United States',
                                       u'link_route': u'/s/United-States',
                                       u'link_text': u'United States',
                                       u'search_text': u'United States'}],
 u'neighborhood_localized_name': None,
 u'place_recommendations': [],
 u'user_info': {u'user_image': u''}}

Decoding HTML character entities in JSON

Answers (2)

Related Questions