Ankit Gadodia
Ankit Gadodia

Reputation: 7

converting json data to dataframe or csv

I am trying to convert a json file into a table format where I have the distinct fields in the first row and the corresponding data for those fields in the rest of the rows.

converting my json to text, this is what the data looks like (just a small sample to add some context)

{"business_id": "O_X3PGhk3Y5JWVi866qlJg", "full_address": "1501 W Bell Rd\nPhoenix, AZ 85023", "hours": {"Monday": {"close": "18:00", "open": "11:00"}, "Tuesday": {"close": "18:00", "open": "11:00"}, "Friday": {"close": "18:00", "open": "11:00"}, "Wednesday": {"close": "18:00", "open": "11:00"}, "Thursday": {"close": "18:00", "open": "11:00"}, "Sunday": {"close": "18:00", "open": "11:00"}, "Saturday": {"close": "18:00", "open": "11:00"}}, "open": true, "categories": ["Active Life", "Arts & Entertainment", "Stadiums & Arenas", "Horse Racing"], "city": "Phoenix", "review_count": 29, "name": "Turf Paradise Race Course", "neighborhoods": [], "longitude": -112.0923293, "state": "AZ", "stars": 4.0, "latitude": 33.638572699999997, "attributes": {"Take-out": false, "Wi-Fi": "free", "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": false, "brunch": false, "breakfast": false}, "Noise Level": "average", "Takes Reservations": true, "Has TV": true, "Delivery": false, "Ambience": {"romantic": false, "intimate": false, "touristy": false, "hipster": false, "divey": false, "classy": false, "trendy": false, "upscale": false, "casual": false}, "Parking": {"garage": false, "street": false, "validated": false, "lot": true, "valet": true}, "Wheelchair Accessible": true, "Outdoor Seating": true, "Attire": "casual", "Alcohol": "full_bar", "Waiter Service": true, "Accepts Credit Cards": true, "Good for Kids": false, "Good For Groups": true, "Price Range": 2}, "type": "business"}

I am having some trouble with the initial code:

import json
data=json.load(open('yelp_academic_dataset_user.json'))

I get the following error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-fbf46968052d> in <module>()
      1 
      2 import json
----> 3 data=json.load(open('yelp_academic_dataset_user.json'))

C:\Users\ankit.gadodia\AppData\Local\Continuum\Anaconda\lib\json\__init__.pyc in load(fp, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    288         parse_float=parse_float, parse_int=parse_int,
    289         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook,
--> 290         **kw)
    291 
    292 

C:\Users\ankit.gadodia\AppData\Local\Continuum\Anaconda\lib\json\__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    336             parse_int is None and parse_float is None and
    337             parse_constant is None and object_pairs_hook is None and not kw):
--> 338         return _default_decoder.decode(s)
    339     if cls is None:
    340         cls = JSONDecoder

C:\Users\ankit.gadodia\AppData\Local\Continuum\Anaconda\lib\json\decoder.pyc in decode(self, s, _w)
    367         end = _w(s, end).end()
    368         if end != len(s):
--> 369             raise ValueError(errmsg("Extra data", s, end, len(s)))
    370         return obj
    371 

ValueError: Extra data: line 2 column 1 - line 70818 column 1 (char 259 - 26982351)

Could someone explain how I can rectify this or some other way to start. All other examples I have seen so far involve the json.load function. I even tried the file.read() function but that is giving an error too

Upvotes: 1

Views: 571

Answers (1)

abarnert
abarnert

Reputation: 365817

There are two possible problems here.


First, in the actual data you've posted, assuming the \n in your output is actually a newline character, rather than a backslash followed by an n, that isn't valid JSON, and you've probably corrupted it somewhere between output and input.

If you can get back to the original data, that would be best.

If you can't, you can try escaping the control characters, e.g., by encoding to unicode-escape. For your existing example, and a lot of real-life data, that will recover the original JSON, but there are edge cases where it won't.

The easiest way to do this, if your file is small enough to load into memory, is:

with open('yelp_academic_dataset_user.json') as f:
    contents = f.read().encode('unicode-escape').decode('ascii')
data = json.loads(contents)

If your files get huge, see the codecs module for how to chain the encoder and decoder in front of the file so you can convert on the fly.


Second, the "line 2 column 1" sounds suspiciously like you have a stream of JSON objects in the file, rather than a single JSON object. This problem isn't visible in the sample you showed us, but maybe some of your files have only one JSON object while others have two or more.

Again, if you can fix the data (e.g., make it a single JSON array, instead of a stream of separate objects), that's always the best solution.

If not, you can't parse that with json.load, because a stream of JSON objects is not a JSON object. What you need to do is to use raw_decode to pull off JSON objects until you're done:

with open('yelp_academic_dataset_user.json') as f:
    contents = f.read()
decoder = json.JSONDecoder()
while contents:
    data, idx = decoder.raw_decode(contents)
    contents = contents[idx:]

Of course in this case, you'll have to use each data within the for loop, or accumulate them into some collection.

In the case where you know that each JSON object will be on a line by itself, you can do something simpler (and more efficient):

with open('yelp_academic_dataset_user.json') as f:
    for line in f:
        data = json.loads(line)

But since it's perfectly legal to have newlines in the middle of JSON objects (just not inside strings), this may be pretty brittle.

Upvotes: 2

Related Questions