Henry Dashwood
Henry Dashwood

Reputation: 313

Reading json in python separated by newlines

I am trying to read some json with the following format. A simple pd.read_json() returns ValueError: Trailing data. Adding lines=True returns ValueError: Expected object or value. I've tried various combinations of readlines() and load()/loads() so far without success.

Any ideas how I could get this into a dataframe?

{
    "content": "kdjfsfkjlffsdkj",
    "source": {
        "name": "jfkldsjf"
    },
    "title": "dsldkjfslj",
    "url": "vkljfklgjkdlgj"
}

{
    "content": "djlskgfdklgjkfgj",
    "source": {
        "name": "ldfjkdfjs"
    },
    "title": "lfsjdfklfldsjf",
    "url": "lkjlfggdflkjgdlf"
}

Upvotes: 0

Views: 2902

Answers (4)

win_wave
win_wave

Reputation: 1508

If you can use jq then solution is simpler:

jq -s '.' path/to/original.json > path/to/reformatted.json

Upvotes: 0

SarahJessica
SarahJessica

Reputation: 524

The sample you have above isn't valid JSON. To be valid JSON these objects need to be within a JS array ([]) and be comma separated, as follows:

[{
    "content": "kdjfsfkjlffsdkj",
    "source": {
        "name": "jfkldsjf"
    },
    "title": "dsldkjfslj",
    "url": "vkljfklgjkdlgj"
},

{
    "content": "djlskgfdklgjkfgj",
    "source": {
        "name": "ldfjkdfjs"
    },
    "title": "lfsjdfklfldsjf",
    "url": "lkjlfggdflkjgdlf"
}]

I just tried on my machine. When formatted correctly, it works

>>> pd.read_json('data.json')
            content                 source           title               url
0   kdjfsfkjlffsdkj   {'name': 'jfkldsjf'}      dsldkjfslj    vkljfklgjkdlgj
1  djlskgfdklgjkfgj  {'name': 'ldfjkdfjs'}  lfsjdfklfldsjf  lkjlfggdflkjgdlf

Upvotes: 3

Henry Dashwood
Henry Dashwood

Reputation: 313

Thanks for the ideas internet. None quite solved the problem in the way I needed (I had lots of newline characters in the strings themselves which meant I couldn't split on them) but they helped point the way. In case anyone has a similar problem, this is what worked for me:

with open('path/to/original.json', 'r') as f:
    data = f.read()  
    data = data.split("}\n")
    data = [d.strip() + "}" for d in data]
    data = list(filter(("}").__ne__, data))
    data = [json.loads(d) for d in data]

with open('path/to/reformatted.json', 'w') as f:
    json.dump(data, f)

df = pd.read_json('path/to/reformatted.json')

Upvotes: 0

Silveris
Silveris

Reputation: 1186

Another solution if you do not want to reformat your files. Assuming your JSON is in a string called my_json you could do:

import json
import pandas as pd

splitted = my_json.split('\n\n')
my_list = [json.loads(e) for e in splitted]
df = pd.DataFrame(my_list)

Upvotes: 0

Related Questions