johnnydoe
johnnydoe

Reputation: 402

Parsing a json gives JSONDecodeError: Unterminated string

I have a document with new-line-delimited json's, to which I apply some functions. Everything works up until this line, which looks exactly like this:

{"_id": "5f114", "type": ["Type1", "Type2"], "company": ["5e84734"], "answers": [{"title": " answer 1", "value": false}, {"title": "answer 2
", "value": true}, {"title": "This is a title.", "value": true}, {"title": "This is another title", "value": true}], "audios": [null], "text": {}, "lastUpdate": "2020-07-17T06:24:50.562Z", "title": "This is a question?", "description": "1000 €.", "image": "image.jpg", "__v": 0}

The entire code:

import json  

def unimportant_function(d):
    d.pop('audios', None)
    return {k:v for k,v in d.items() if v != {}}


def parse_ndjson(data):
    return [json.loads(l) for l in data.splitlines()]

with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
    data = handle.read()
    dicts = parse_ndjson(data)

for d in dicts:
    new_d = unimportant_function(d)
    json_string=json.dumps(new_d, ensure_ascii=False)
    print(json_string)

The error JSONDecodeError: Unterminated string starting at: line 1 column 260 (char 259) happens at dicts = parse_ndjson(data). Why? I also have no idea what that symbol after "answer 2" is, it didn't appear in the data but it appeared when I copy pasted it.

What is the problem with the data?

Upvotes: 1

Views: 7413

Answers (1)

Karl Knechtel
Karl Knechtel

Reputation: 61478

The unprintable character embedded in the "answer 2" string is a paragraph separator, which is treated as whitespace by .splitlines():

>>> 'foo\u2029bar'.splitlines()
['foo', 'bar']

(Speculation: the ndjson file might be exploiting this to represent "this string should have a newline in it", working around the file format. If so, it should probably be using a \n escape instead.)

The character is, however, not treated specially if you iterate over the lines of the file normally:

>>> # For demonstration purposes, I create a `StringIO`
>>> # from a hard-coded string. A file object reading
>>> # from disk will behave similarly.
>>> import io
>>> for line in io.StringIO('foo\u2029bar'):
...     print(repr(line))
...
'foo\u2029bar'

So, the simple fix is to make parse_ndjson expect a sequence of lines already - don't call .splitlines, and fix the calling code appropriately. You can either pass the open handle directly:

def parse_ndjson(data):
    return [json.loads(l) for l in data]

with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
    dicts = parse_ndjson(handle)

or pass it to list to create a list explicitly:

def parse_ndjson(data):
    return [json.loads(l) for l in data]

with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
    dicts = parse_ndjson(list(handle))

or create the list using the provided .readlines() method:

def parse_ndjson(data):
    return [json.loads(l) for l in data]

with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
    dicts = parse_ndjson(handle.readlines())

Upvotes: 3

Related Questions