Chester Mc Allister
Chester Mc Allister

Reputation: 437

json value is an html string - how to parse it in python?

I have a JSON file like that:

{
    "entryLabel": "cat",
    "entryContent": "<div class=\"entry_container\"><div class=\"entry lang_en-gb\" id=\"cat_1\"><span class=\"inline\"><h1 class=\"hwd\">cat<\/h1><span> [<\/span><span class=\"pron\" type=\"\">ˈkæt<a href=\"#\" class=\"playback\"><img src=\"https://api.collinsdictionary.com/external/images/redspeaker.gif?version=2013-10-30-1535\" alt=\"Pronunciation for cat\" class=\"sound\" title=\"Pronunciation for cat\" style=\"cursor: pointer\"/><\/a><audio type=\"pronunciation\" title=\"cat\"><source type=\"audio/mpeg\" src=\"https://api.collinsdictionary.com/media/sounds/sounds/0/081/08189/08189.mp3\"/>Your browser does not support HTML5 audio.<\/audio><\/span><span>]<\/span><\/span><div class=\"hom\" id=\"cat_1.1\"><span>   <\/span><span class=\"gramGrp\"><span class=\"pos\">noun<\/span><\/span><div class=\"sense\"><span>   <\/span><span class=\"bold\">1 <\/span><span class=\"lbl\"><span>(<\/span>domestic<span>)<\/span><\/span><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">chat <em class=\"hi\">m<\/em><\/span><\/span><span class=\"cit\" id=\"cat_1.2\"><span>;   <\/span><span class=\"quote\">Have you got a cat?<\/span><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">Est-ce que tu as un chat?<\/span><\/span><\/span><span class=\"re\" id=\"cat_1.3\"><span>;   <\/span><span class=\"inline\"><span class=\"orth\">to let the cat out of the bag<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">vendre la mèche<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.4\"><span>;   <\/span><span class=\"inline\"><span class=\"orth\">curiosity killed the cat<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">la curiosité est toujours punie<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.5\"><span>;   <\/span><span class=\"inline\"><span class=\"orth\">to look like sth the cat dragged in<\/span><\/span><span class=\"inline\"><span>, <\/span><span class=\"orth\">to look like sth the cat brought in<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">être dans un état lamentable<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.6\"><span>;   <\/span><span class=\"inline\"><span class=\"orth\">to play cat and mouse with sb<\/span><\/span><span class=\"inline\"><span>, <\/span><span class=\"orth\">to play a game of cat and mouse with sb<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">jouer au chat et à la souris avec qn<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.7\"><span>;   <\/span><span class=\"inline\"><span class=\"orth\">to put the cat among the pigeons<\/span><\/span><span class=\"inline\"><span>, <\/span><span class=\"orth\">to set the cat among the pigeons<\/span><\/span><span class=\"lbl\"><span> (<\/span>British<span>)<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">jeter un pavé dans la mare<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.8\"><span>;   <\/span><span class=\"inline\"><span class=\"orth\">there's no room to swing a cat<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">on ne peut pas se tourner<\/span><\/span><\/div><!-- End of DIV sense--><\/span><\/div><!-- End of DIV sense--><div class=\"sense\"><span> <br/><\/span><span class=\"bold\">2 <\/span><span class=\"lbl\"><span>(= <\/span>big cat<span>)<\/span><\/span><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">félin <em class=\"hi\">m<\/em><\/span><\/span><span class=\"cit\" id=\"cat_1.9\"><span>;   <\/span><\/span><\/div><!-- End of DIV sense--><\/div><!-- End of DIV hom--><\/div><!-- End of DIV entry lang_en-gb--><\/div><!-- End of DIV entry_container-->\n"
}

I need to parse this JSON file but for the data "entryContent" the value is an HTML string. I can probably transform the structure of my initial JSON file or parse the HTML string directly ? I need some advice.

For now I just have this code:

import json
from pprint import pprint
json_data=open('cat.json')

data = json.load(json_data)
#pprint(data)

print data["dictionaryCode"]    
print data["entryLabel"]    
print data["entryContent"]   

json_data.close()

Finally from the HTML I need to get the value of this span <span class="pron" type="">ˈkæt</span> ; the src value of the source element <source type="audio/mpeg" src="https://api.collinsdictionary.com/media/sounds/sounds/0/081/08189/08189.mp3"/> ; the value of the span class pos <span class="gramGrp"><span class="pos">noun</span></span>; and all the senses provided by the div element

<div class="sense">
<span> <br/></span>
<span class="bold">2 </span><span class="lbl"><span>(= </span>big cat<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">félin <em class="hi">m</em></span></span><span class="cit" id="cat_1.9"><span>;   </span></span>
</div>

Upvotes: 1

Views: 2841

Answers (1)

Reut Sharabani
Reut Sharabani

Reputation: 31349

Try using BeautifulSoup:

import json
from bs4 import BeautifulSoup

# json_data=open('cat.json')    
# data = json.load(json_data)
# using json.load and the 'with' context (to close file when not needed...)
with open('cat.json') as f:
    json_data = json.load(f)

print data["dictionaryCode"]    
print data["entryLabel"]
entryContentHTML = BeautifulSoup(data["entryContent"])  
print entryContentHTML.prettify()

# json_data.close()

Upvotes: 3

Related Questions