Jasmine
Jasmine

Reputation: 391

How to slice a json file, only extract part of the fields

I am trying to slice a json file, the file looks like this:

{"price": 17.95, "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg", "title": "Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)", "salesRank": {"Musical Instruments": 207315}, "asin": "0006428320"}
{"description": "Composer: J.S. Bach.Peters Edition.For two violins and pianos.", "related": {"also_viewed": ["B0058DK7RA"], "buy_after_viewing": ["B0058DK7RA"]}, "categories": [["Musical Instruments"]], "brand": "", "imUrl": "http://ecx.images-amazon.com/images/I/41m6ygCqc8L._SY300_.jpg", "title": "Double Concerto in D Minor By Johann Sebastian Bach. Edited By David Oistrach. For Violin I, Violin Ii and Piano Accompaniment. Urtext. Baroque. Medium. Set of Performance Parts. Solo Parts, Piano Reduction and Introductory Text. BWV 1043.", "salesRank": {"Musical Instruments": 94593}, "asin": "0014072149", "price": 18.77}
{"asin": "0041291905", "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41maAqSO9hL._SY300_.jpg", "title": "Hal Leonard Vivaldi Four Seasons for Piano (Original Italian Text)", "salesRank": {"Musical Instruments": 222972}, "description": "Vivaldi's famous set of four violin concertos certainly ranks among the all-time top ten classical favorites. Features include an introduction about the history of The Four Seasons and Vivaldi's original vivid Italian score markings. A must for classical purists."}

You can see the fields is not arrange strictly in all the lines and i only need part of the fields. so I wrote this code:

import json, csv

infile = open("sample_output.strict", "r")
outfile = open("output.csv", "w")
writer = csv.writer(outfile)

fileds = ["asin","price"]    
for product in json.loads(infile.read()):
    line = []
    for f in fields:
        if product.has_key(f):
            line.append(product[f])
        else:
            line.append("")
        writer.write(line)

I got below error msg:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-3e335b184eea> in <module>()
      6 
      7 fileds = ["asin","price"]
----> 8 for product in json.loads(infile.read()):
      9     line = []
     10     for f in fields:

C:\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    316             parse_int is None and parse_float is None and
    317             parse_constant is None and object_pairs_hook is None and not kw):
--> 318         return _default_decoder.decode(s)
    319     if cls is None:
    320         cls = JSONDecoder

C:\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
    344         end = _w(s, end).end()
    345         if end != len(s):
--> 346             raise ValueError(errmsg("Extra data", s, end, len(s)))
    347         return obj
    348 

ValueError: Extra data: line 2 column 1 - line 3 column 617 (char 339 - 1581) 

Upvotes: 1

Views: 10759

Answers (2)

woot
woot

Reputation: 7616

What you have is lines of json, not a single json document. Change your program to read each line and convert it to json, then look in each document that way. This is actually pretty common, I receive data to load all the time that is in this format.

Doing it line by line will save you a lot on memory if you end up dealing with large files anyhow.

import json, csv

with open("sample_output.strict", "r") as infile:
    with open("output.csv", "w") as outfile:
        writer = csv.writer(outfile)

        fields = ["asin","price"]  
        for json_line in infile:  
            product = json.loads(json_line)
            line = []
            for f in fields:
                if product.has_key(f):
                    line.append(product[f])
                else:
                    line.append("")
            writer.writerow(line)

Upvotes: 1

Thejaswi
Thejaswi

Reputation: 26

Your input json file is ill-formed. That is the reason you are seeing this error. In short, you cannot have multiple JSON "objects" in a single file. However, in your case, there are 3 hashes being seen. One solution for this is to encompass them with a top-level list like this:

[
    {"price": 17.95, "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41EpRmh8MEL._SY300_.jpg", "title": "Six Sonatas For Two Flutes Or Violins, Volume 2 (#4-6)", "salesRank": {"Musical Instruments": 207315}, "asin": "0006428320"},
    {"description": "Composer: J.S. Bach.Peters Edition.For two violins and pianos.", "related": {"also_viewed": ["B0058DK7RA"], "buy_after_viewing": ["B0058DK7RA"]}, "categories": [["Musical Instruments"]], "brand": "", "imUrl": "http://ecx.images-amazon.com/images/I/41m6ygCqc8L._SY300_.jpg", "title": "Double Concerto in D Minor By Johann Sebastian Bach. Edited By David Oistrach. For Violin I, Violin Ii and Piano Accompaniment. Urtext. Baroque. Medium. Set of Performance Parts. Solo Parts, Piano Reduction and Introductory Text. BWV 1043.", "salesRank": {"Musical Instruments": 94593}, "asin": "0014072149", "price": 18.77},
    {"asin": "0041291905", "categories": [["Musical Instruments", "Instrument Accessories", "General Accessories", "Sheet Music Folders"]], "imUrl": "http://ecx.images-amazon.com/images/I/41maAqSO9hL._SY300_.jpg", "title": "Hal Leonard Vivaldi Four Seasons for Piano (Original Italian Text)", "salesRank": {"Musical Instruments": 222972}, "description": "Vivaldi's famous set of four violin concertos certainly ranks among the all-time top ten classical favorites. Features include an introduction about the history of The Four Seasons and Vivaldi's original vivid Italian score markings. A must for classical purists."}
]

Then you can use the following piece of code to slice:

import json, csv
infile = open("sample_output.strict", "r")
jsondata = json.loads(infile.read())
outfile = open("output.csv", "w")
writer = csv.writer(outfile)
fields = ["asin","price"]
for product in jsondata:
    line = []
    for f in fields:
        if f in product:
            line.append(product)
            break   # I assume you need to print only once per match!?
        else:
            line.append("")
    writer.write(line)

I don't understand what you're trying to do with csv output, so I just copied it as it is, to demonstrate the fix.

Upvotes: 0

Related Questions