Reputation: 557

Parsing multiple json objects from a text file using Python

I have a .json file where each line is an object. For example, first two lines are:

{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}

{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}

I have tried processing using ijson lib as follows:

with open(filename, 'r') as f:
    objects = ijson.items(f, 'columns.items')
    columns = list(objects)

However, i get error:

JSONError: Additional data

Its seems due to multiple objects I'm receiving such error.

Whats the recommended way for analyzing such Json file in Jupyter?

Thank You in advance

Upvotes: 3

Answers (3)

A-y

Reputation: 793

While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.

You can aggregate these objects in one list, and from there do whatever you like with your data :

import json
with open(filename, 'r') as f:
    object_list = []
    for line in f.readlines():
        object_list.append(json.loads(line))
    # object_list will contain all of your file's data

You could do it as a list comprehension to have it a little more pythonic :

with open(filename, 'r') as f:    
    object_list = [json.loads(line) 
                   for line in f.readlines()]
    # object_list will contain all of your file's data

Upvotes: 2

WurzelseppQX

Reputation: 540

The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:

[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]

Here is some code how to clean your file:

lastline = None

with open("yourfile.json","r") as f:
    lineList = f.readlines()
    lastline=lineList[-1]

with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
    for i,line in enumerate(f,0):
        if i == 0:
            line = "["+str(line)+","
            g.write(line)
        elif line == lastline:            
            g.write(line)
            g.write("]")
        else:
            line = str(line)+","
            g.write(line)

To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).

import pandas as pd

#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")

If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:

df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2

Upvotes: 3

C.Nivs

Reputation: 13106

You have multiple lines in your file, so that's why it's throwing errors

import json

with open(filename, 'r') as f:
    lines = f.readlines()
    first = json.loads(lines[0])
    second = json.loads(lines[1])

That should catch both lines and load them in properly

Upvotes: 1

Parsing multiple json objects from a text file using Python

Answers (3)

Related Questions