Reputation: 402
I have a document with new-line-delimited json's, to which I apply some functions. Everything works up until this line, which looks exactly like this:
{"_id": "5f114", "type": ["Type1", "Type2"], "company": ["5e84734"], "answers": [{"title": " answer 1", "value": false}, {"title": "answer 2
", "value": true}, {"title": "This is a title.", "value": true}, {"title": "This is another title", "value": true}], "audios": [null], "text": {}, "lastUpdate": "2020-07-17T06:24:50.562Z", "title": "This is a question?", "description": "1000 €.", "image": "image.jpg", "__v": 0}
The entire code:
import json
def unimportant_function(d):
d.pop('audios', None)
return {k:v for k,v in d.items() if v != {}}
def parse_ndjson(data):
return [json.loads(l) for l in data.splitlines()]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
data = handle.read()
dicts = parse_ndjson(data)
for d in dicts:
new_d = unimportant_function(d)
json_string=json.dumps(new_d, ensure_ascii=False)
print(json_string)
The error JSONDecodeError: Unterminated string starting at: line 1 column 260 (char 259)
happens at dicts = parse_ndjson(data)
. Why? I also have no idea what that symbol after "answer 2" is, it didn't appear in the data but it appeared when I copy pasted it.
What is the problem with the data?
Upvotes: 1
Views: 7413
Reputation: 61478
The unprintable character embedded in the "answer 2"
string is a paragraph separator, which is treated as whitespace by .splitlines()
:
>>> 'foo\u2029bar'.splitlines()
['foo', 'bar']
(Speculation: the ndjson file might be exploiting this to represent "this string should have a newline in it", working around the file format. If so, it should probably be using a \n
escape instead.)
The character is, however, not treated specially if you iterate over the lines of the file normally:
>>> # For demonstration purposes, I create a `StringIO`
>>> # from a hard-coded string. A file object reading
>>> # from disk will behave similarly.
>>> import io
>>> for line in io.StringIO('foo\u2029bar'):
... print(repr(line))
...
'foo\u2029bar'
So, the simple fix is to make parse_ndjson
expect a sequence of lines already - don't call .splitlines
, and fix the calling code appropriately. You can either pass the open handle
directly:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(handle)
or pass it to list
to create a list explicitly:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(list(handle))
or create the list using the provided .readlines()
method:
def parse_ndjson(data):
return [json.loads(l) for l in data]
with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
dicts = parse_ndjson(handle.readlines())
Upvotes: 3