Reputation: 13
I am trying to parse a large JSON file (16GB) using ijson but I always get the following error :
Exception has occurred: IncompleteJSONError
lexical error: invalid char in json text.
venue" : { "type" : NumberInt(0) }, "yea
(right here) ------^
File "C:\pyth\dblp_parser.py", line 14, in <module>
for record in ijson.items(f, 'item', use_float=True):
My code is as follows:
with open("dblpv13.json", "rb") as f:
for record in ijson.items(f, 'records.item', use_float=True):
paper_id = record["_id"] #_id is only for test
paper_id_tab.append(paper_id)
A part of my json file is as follows:
{
"_id" : "53e99784b7602d9701f3f636",
"title" : "Flatlined",
"authors" : [
{
"_id" : "53f58b15dabfaece00f8046d",
"name" : "Peter J. Denning",
"org" : "ACM Education Board",
"gid" : "5b86c72de1cd8e14a3c2b772",
"oid" : "544bd99545ce266baef0668a",
"orgid" : "5f71b2811c455f439fe3c58a"
}
],
"venue" : {
"_id" : "555036f57cea80f954169e28",
"raw" : "Commun. ACM",
"raw_zh" : null,
"publisher" : null,
"type" : NumberInt(0)
},
"year" : NumberInt(2002),
"keywords" : [
"linear scale",
"false dichotomy"
],
"n_citation" : NumberInt(7),
"page_start" : "15",
"page_end" : "19",
"lang" : "en",
"volume" : "45",
"issue" : "6",
"issn" : "",
"isbn" : "",
"doi" : "10.1145/508448.508463",
"pdf" : "",
"url" : [
"http://doi.acm.org/10.1145/508448.508463"
],
"abstract" : "Our propensity to create linear scales between opposing alternatives creates false dichotomies that hamper our thinking and limit our action."
},
I tried to fill in records
item by item but always the same error. I'm completely blocked.
Please, can any body help me?
Upvotes: 0
Views: 652
Reputation: 16
The same problem happened to me with the said dataset. ijson can't handle it. I overcame the problem by creating another dataset and then parsing the new dataset with ijson. The approach is quite simple: read the orignal dataset with simple read; remove "NumberInt(" and ")", write the result to a new json file. the code is given below.
f=open('dblpv13_clean.json')
with open('dblpv13.json','r',errors='ignore') as myFile:
for line in myFile:
line=line.replace("NumberInt(","").replace(")","")
f.write(line)
f.close()
Now you can parse the new dataset with ijson as follows.
with open('dblpv13_clean.json', "r",errors='ignore') as f:
for i, element in enumerate(ijson.items(f, "item")):
do something....
Upvotes: 0