Reputation: 391
I am trying to slice a json file, the file looks like this:
{"price": 17.95, "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg", "title": "Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)", "salesRank": {"Musical Instruments": 207315}, "asin": "0006428320"}
{"description": "Composer: J.S. Bach.Peters Edition.For two violins and pianos.", "related": {"also_viewed": ["B0058DK7RA"], "buy_after_viewing": ["B0058DK7RA"]}, "categories": [["Musical Instruments"]], "brand": "", "imUrl": "http://ecx.images-amazon.com/images/I/41m6ygCqc8L._SY300_.jpg", "title": "Double Concerto in D Minor By Johann Sebastian Bach. Edited By David Oistrach. For Violin I, Violin Ii and Piano Accompaniment. Urtext. Baroque. Medium. Set of Performance Parts. Solo Parts, Piano Reduction and Introductory Text. BWV 1043.", "salesRank": {"Musical Instruments": 94593}, "asin": "0014072149", "price": 18.77}
{"asin": "0041291905", "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41maAqSO9hL._SY300_.jpg", "title": "Hal Leonard Vivaldi Four Seasons for Piano (Original Italian Text)", "salesRank": {"Musical Instruments": 222972}, "description": "Vivaldi's famous set of four violin concertos certainly ranks among the all-time top ten classical favorites. Features include an introduction about the history of The Four Seasons and Vivaldi's original vivid Italian score markings. A must for classical purists."}
You can see the fields is not arrange strictly in all the lines and i only need part of the fields. so I wrote this code:
import json, csv
infile = open("sample_output.strict", "r")
outfile = open("output.csv", "w")
writer = csv.writer(outfile)
fileds = ["asin","price"]
for product in json.loads(infile.read()):
line = []
for f in fields:
if product.has_key(f):
line.append(product[f])
else:
line.append("")
writer.write(line)
I got below error msg:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-3e335b184eea> in <module>()
6
7 fileds = ["asin","price"]
----> 8 for product in json.loads(infile.read()):
9 line = []
10 for f in fields:
C:\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
316 parse_int is None and parse_float is None and
317 parse_constant is None and object_pairs_hook is None and not kw):
--> 318 return _default_decoder.decode(s)
319 if cls is None:
320 cls = JSONDecoder
C:\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
344 end = _w(s, end).end()
345 if end != len(s):
--> 346 raise ValueError(errmsg("Extra data", s, end, len(s)))
347 return obj
348
ValueError: Extra data: line 2 column 1 - line 3 column 617 (char 339 - 1581)
Upvotes: 1
Views: 10759
Reputation: 7616
What you have is lines of json, not a single json document. Change your program to read each line and convert it to json, then look in each document that way. This is actually pretty common, I receive data to load all the time that is in this format.
Doing it line by line will save you a lot on memory if you end up dealing with large files anyhow.
import json, csv
with open("sample_output.strict", "r") as infile:
with open("output.csv", "w") as outfile:
writer = csv.writer(outfile)
fields = ["asin","price"]
for json_line in infile:
product = json.loads(json_line)
line = []
for f in fields:
if product.has_key(f):
line.append(product[f])
else:
line.append("")
writer.writerow(line)
Upvotes: 1
Reputation: 26
Your input json file is ill-formed. That is the reason you are seeing this error. In short, you cannot have multiple JSON "objects" in a single file. However, in your case, there are 3 hashes being seen. One solution for this is to encompass them with a top-level list like this:
[
{"price": 17.95, "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg", "title": "Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)", "salesRank": {"Musical Instruments": 207315}, "asin": "0006428320"},
{"description": "Composer: J.S. Bach.Peters Edition.For two violins and pianos.", "related": {"also_viewed": ["B0058DK7RA"], "buy_after_viewing": ["B0058DK7RA"]}, "categories": [["Musical Instruments"]], "brand": "", "imUrl": "http://ecx.images-amazon.com/images/I/41m6ygCqc8L._SY300_.jpg", "title": "Double Concerto in D Minor By Johann Sebastian Bach. Edited By David Oistrach. For Violin I, Violin Ii and Piano Accompaniment. Urtext. Baroque. Medium. Set of Performance Parts. Solo Parts, Piano Reduction and Introductory Text. BWV 1043.", "salesRank": {"Musical Instruments": 94593}, "asin": "0014072149", "price": 18.77},
{"asin": "0041291905", "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41maAqSO9hL._SY300_.jpg", "title": "Hal Leonard Vivaldi Four Seasons for Piano (Original Italian Text)", "salesRank": {"Musical Instruments": 222972}, "description": "Vivaldi's famous set of four violin concertos certainly ranks among the all-time top ten classical favorites. Features include an introduction about the history of The Four Seasons and Vivaldi's original vivid Italian score markings. A must for classical purists."}
]
Then you can use the following piece of code to slice:
import json, csv
infile = open("sample_output.strict", "r")
jsondata = json.loads(infile.read())
outfile = open("output.csv", "w")
writer = csv.writer(outfile)
fields = ["asin","price"]
for product in jsondata:
line = []
for f in fields:
if f in product:
line.append(product)
break # I assume you need to print only once per match!?
else:
line.append("")
writer.write(line)
I don't understand what you're trying to do with csv output, so I just copied it as it is, to demonstrate the fix.
Upvotes: 0