Reputation: 1
I have a JSON file containing several dictionaries; each with lots of information about a specific website. I would like to write a program which can iterate through the dictionaries and output strictly the HTML code found within each dictionary, which is found (parsed) as data["p80"]["http"]["get"]["body"]
.
Below is an example of two of the dictionaries in the JSON file.
{"p80":{"http":{"get":{"body": "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\n\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n\t<head>\n\t\t<title>Motormax</title>\n <meta name=viewport content=\"width=device-width, initial-scale=1.0\" />\r\n<meta name=\"google-site-verification\" content=\"wqSGgrJPlLskInflNQPXn9oY25etuJYuRQonZ0k0I_o\" />\r\n<link href='https://fonts.googleapis.com/css?family=Lato:400,700,900' rel='stylesheet' type='text/css'>\r\n \t\t<meta name=\"description\" content=\"\" /> \n\t\t<meta name=\"keywords\" content=\"Motormaax, Renault, Chevrolet, Nissan, Peugeot, Volkswagen, Ford, Planes de ahorro, financiaci\u00f3n, cuotas, autos en cuotas\" /> \n\t\t<meta http-equiv=\"Content-type\" content=\"text/html; charset=UTF-8\" />\n\t\t\n <script src=\"/processedjs/kms427.js\" type=\"text/javascript\"></script> <link rel=\"stylesheet\" type=\"text/css\" href=\"/processedcss/kms427.css\" />\n\t\t\n\t\t<script type=\"text/javascript\">\n\t\t\tvar dataLayer = [];\n\t\t</script>\n <script type=\"text/javascript\">(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':\r\nnew Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],\r\nj=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=\r\n'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);\r\n})(window,document,'script','dataLayer','GTM-582XL3');</script>\n\t\t\n\t\t\n\n\n\t</head>\n\t<body>\n\t<div style=\"visibility: hidden; display: none;\"></div>\r\n<div class=\"main\">\r\n\t\t\t<p><img src=\"/templatepagina/template_246/images/logo_motormax.png\" alt=\"Motormax\" /></p>\r\n\t\t\t<h1>TE ACOMPA\u00d1AMOS EN LA COMPRA DE TU <b>NUEVO AUTO</b></h1>\r\n\t\t\t<p id=\"line\"></p>\r\n\t\t\t\r\n\t\r\n<ul class=\"marcas\">\t\t\t\r\n<a href=\"/peugeot\"><li id=\"peugeot\"><p>Peugeot</p></li></a>\r\n\t\t\t\t<a href=\"/fiat\"><li id=\"fiat\"><p>fiat</p></li></a>\r\n\t\t\t\t<a href=\"/ford\"><li id=\"ford\"><p>ford</p></li></a>\r\n\t\t\t\t<a href=\"/renault\"><li id=\"renault\"><p>renault</p></li></a>\r\n <a href=\"/volkswagen\"><li id=\"vw\"><p>vw</p></li></a>\r\n\t\t\t\r\n\t\r\n\t\t\t\t<!-- <li id=\"nissan\"><p>nissan</p></li> -->\r\n\t\t\t</ul>\r\n\t\t</div>\t\r\n</body>\n</html>", "body_sha256": "fEHZCw9VEdmwVabOd0g8TntigYiA9AsL+sKicdipejU=", "headers": {"cache_control": "post-check=0, pre-check=0", "content_length": "2118", "content_type": "text/html; charset=UTF-8", "expires": "Thu, 19 Nov 1981 08:52:00 GMT", "pragma": "no-cache", "server": "Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips PHP/5.4.16", "unknown": [{"key": "date", "value": "Mon, 07 Nov 2016 16:36:25 GMT"}], "x_powered_by": "PHP/5.4.16"}, "metadata": {"description": "Apache httpd 2.4.6", "manufacturer": "Apache", "product": "httpd", "version": "2.4.6"}, "status_code": 200, "status_line": "200 OK", "title": "Motormax", "timestamp":"2016-11-09 12:28:36"}}}}
{"p80":{"http":{"get":{"body": " \n<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\"\n\"http://www.w3.org/TR/html4/loose.dtd\">\n<html>\n<head>\n<title>Kody pocztowe - wyszukiwarka</title>\n<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=iso-8859-2\">\n<META NAME=\"Keywords\" CONTENT=\"kody pocztowe, kod pocztowy, Poczta Polska, przesy\ufffdki, listy\">\n<META NAME=\"Description\" CONTENT=\"Na tej stronie mo\ufffdesz wyszuka\ufffd kody pocztowe dowolnych miejscowo\ufffdci w Polsce. Podaj miasto, ulic\ufffd i znajd\ufffd potrzebny Ci kod pocztowy. Jest on niezb\ufffddny, je\ufffdli list lub inna przesy\ufffdka ma dotrze\ufffd do adresata na terenie Polski.\">\n<META HTTP-EQUIV=\"Content-Language\" CONTENT=\"PL\">\n<META NAME=\"distribution\" CONTENT=\"Global\">\n<META NAME=\"revisit-after\" CONTENT=\"2 days\">\n<META NAME=\"robots\" CONTENT=\"INDEX,FOLLOW\">\n<style type=\"text/css\">body, td {\nfont-family:arial;\nfont-size:12px;\nmargin:10px 0 10px 0;\ncolor:#000000;\n}\n\n.row { padding: 4px 10px 4px 0; text-align:left}\ninput { }\nimg { border:0;}\n.thead {\ncolor:#FFFFFF; font-size:10px;\nbackground-image:url(http://00-000.pl/gfx/lay/box_top_bg.gif);\npadding:0;\n}\n.pltd{\npadding-right:40px;\ntext-align:right;\nbackground-image:url(http://00-000.pl/gfx/lay/box_bg.gif);\n\ncolor:#000000;\nfont-family:arial;\nfont-size:13px;\nfont-weight:bold;\n}\n.zera{\ncolor:#f26624;\nfont-family:arial;\nfont-size:30px;\n}\n.zeras{\ncolor:#f26624;\nfont-family:arial;\nfont-size:20px;\n}\n.top_right{\nbackground-image:url(http://00-000.pl/gfx/top_bg.gif);\ntext-align:right;\nwidth:auto;\ncolor:#FFFFFF; font-weight:bold; padding-right:20px;}\n.top_bar{\nbackground-color:#eeeeee;\npadding:0 0px 0 8px;\nfont-size:10px;\n\n}\n\na:link{\ntext-decoration:underline;\ncolor:#000000;\n}\na:visited{\ntext-decoration:underline;\ncolor:#000000;\n}\na:hover{ color:#FF0000;\ntext-decoration:none;\n}\na:link.white{\ncolor:#ffffff;\ntext-decoration:none;\n\n}\na:visited.white{\ncolor:#ffffff;\ntext-decoration:none;\n\n}\na:hover.white{ color:#FF3300;\ntext-decoration:underline;\n\n}\n\na:link.head{\ncolor:#ffffff;\ntext-decoration:none;\nfont-weight:bold;\n}\na:visited.head{\ncolor:#ffffff;\ntext-decoration:none;\nfont-weight:bold;\n}\na:hover.head{ color:#FFFF00;\ntext-decoration:underline;\nfont-weight:bold;\n}\n\nli {\nlist-style-type:square;\nlist-style-position:inside;\n}\nh1{\nfont-family:arial;\nfont-size:25px;\nmargin:0 0 5px 0;\n}\nh3{\nfont-size:15px;\ncolor:#993300;\nmargin:0 0 10px 0;\npadding:0;\n\n}\na:link.linkbox{\ncolor:#009900;\ntext-decoration:none;\n}\na:visited.linkbox{\ncolor:#009900;\ntext-decoration:none;\n}\na:hover.linkbox{\ncolor:#009900;\ntext-decoration:underline;\n}\n\n\n.top_box_orange {\nbackground-image:url(http://00-000.pl/gfx/lay/box_top_bg_orange.gif);\nborder-bottom:1px solid #ffffff; \nfont-weight:bold; padding-left:9px;\nheight:21px;\ncolor:#FFFFFF;\n}\n.top_box_grey {\nbackground-image:url(http://00-000.pl/gfx/lay/box_top_bg_grey.gif);\nborder-bottom:1px solid #ffffff; \nfont-weight:bold; height:21px; padding-left:9px;\ncolor:#FFFFFF;\n}\n.top_box_grey_k {background-color:#999999;\nborder-bottom:1px solid #ffffff; \nfont-weight:bold; height:21px; padding-left:9px;\ncolor:#FFFFFF;\n}\n\n.box{\nbackground-image:url(http://00-000.pl/gfx/lay/box_bg.gif);\npadding:15px 10px 20px 10px;\nline-height:15px\n}\n\n.form_ok {\nmargin:10px 0 10px 0;\nbackground-color:#FFFFCC;\ncolor:#99CC00;\nfont-size:14px;\nfont-weight:bold;\npadding:20px;\ntext-align:left;\nborder: 1px solid #009900;\n}\n.form_bad {\nmargin:10px 0 10px 0;\nbackground-color:#FFFFCC;\ncolor:#CC0000;\nfont-size:14px;\nfont-weight:bold;\npadding:20px;\ntext-align:left;\nborder: 1px solid #990000;\n}\n\na.button {\ndisplay:block;\nbackground-color:#f26623;\ncolor:#fff;\npadding:5px 10px;\n width:150px;\nmargin:0 10px 0 10px;\nfloat:right;\ntext-align:center;\ntext-decoration:none;\n}\na:visited.button { color:#fff;}\na:hover.button {\ntext-decoration:underline;\ncolor:#000;\n\n}\n</style>\n</head>\n<body>\n\n<table cellpadding=\"0\" cellspacing=\"0\" width=\"80%\" align=\"center\" >\n<tr><td align=\"left\" width=\"190\" colspan=\"2\"><a href=\"http://00-000.pl\"><img src=\"http://00-000.pl/gfx/logo.gif\" border=\"0\" width=\"190\" height=\"70\"></a></td>\n<td width=\"100%\" class=\"top_right\" colspan=\"2\">wyszukiwarka kod\ufffdw pocztowych</Td>\n<td width=\"4\"><img src=\"http://00-000.pl/gfx/top_right.gif\" border=\"0\" width=\"4\" height=\"70\"></td>\n</tr>\n\n<tr>\n<td width=\"4\"><img src=\"http://00-000.pl/gfx/lay/top_bar_left.gif\" border=\"0\" width=\"4\" height=\"21\"></td>\n<Td width=\"186\" class=\"top_bar\">Ostatnia aktualizacja: ", "body_sha256": "/OYNeyTKqqDQNpmG1rmKfK8OYAKfUDP1l8jGUnVlyR8="}}}}
Here's my code so far.
import json
from pprint import pprint
import sys
if __name__ == "__main__":
file = open('sample101.json', 'r')
for dict in file:
for key, value in file.items():
pprint(file["p80"]["http"]["get"]["body"])
file.close()
Any help would be greatly appreciated as I am new to Python. Thank you so much!
Upvotes: 0
Views: 1069
Reputation: 77407
If I've got this right, you have a json file that holds a list of dictionaries and you want to extract html from the dictionaries. In that case, you need to parse the entire file as json and then the extraction is simple. Don't name a variable dict
because it masks the built-in dict
class, but otherwise this should do.
import json
from pprint import pprint
import sys
if __name__ == "__main__":
for data_dict in json.load(open('sample101.json', encoding='utf-8')):
pprint(data_dict["p80"]["http"]["get"]["body"])
If you are worried about bad data, you could wrap this all in a try/except block and grab the items one at a time.
for data_dict in json.load(open('sample101.json', encoding='utf-8')):
for key in "p80", "http", "get", "body":
try:
data_dict = data_dict[key]
except (TypeError, KeyError):
print("Error at", key)
print(repr(data_dict))
raise # or remove to continue with next item
UPDATE
Suppose its not a json file but is a file with one json string per line. Then we rework the loop a bit (and stop calling it xxx.json!).
for line in open('sample101.json', encoding='utf-8'):
data_dict = json.loads(line):
for key in "p80", "http", "get", "body":
try:
data_dict = data_dict[key]
except (TypeError, KeyError):
print("Error at", key)
print(repr(data_dict))
raise # or remove to continue with next item
Upvotes: 0
Reputation: 12168
json.load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)
Deserialize fp (a .read()-supporting file-like object containing a JSON document) to a Python object using this conversion table.
file = open('sample101.json', 'r')
py_dict = json.load(file)
Upvotes: 1