Reputation: 115
I have a wrongly-formatted JSON file where I have numbers with leading zeroes.
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
arc = json.loads(p)
I get this error.
JSONDecodeError: Expecting ',' delimiter: line 8 column 24 (char 107)
Here's what is on char 107:
print(p[107])
#0
The problem is: this is the data I have. Here I am only showing two examples, but my file has millions of lines to be parsed, I need a script. At the end of the day, I need this string:
"""[
{
"name": "Alice",
"RegisterNumber": "911100020001"
},
{
"name": "Bob",
"RegisterNumber": "000111110300"
}
]"""
How can I do it?
Upvotes: 1
Views: 954
Reputation: 37023
Since the problem is the leading zeroes, tne easy way to fix the data would be to split it into lines and fix any lines that exhibit the problem. It's cheap and nasty, but this seems to work.
data = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
result = []
for line in data.splitlines():
if ': 0' in line:
while ": 0" in line:
line = line.replace(': 0', ': ')
result.append(line.replace(': ', ': "')+'"')
else:
result.append(line)
data = "".join(result)
arc = json.loads(data)
print(arc)
Upvotes: 1
Reputation: 13600
This probably won't be pretty but you could probably fix this using a regex.
import re
p = "..."
sub = re.sub(r'"RegisterNumber":\W([0-9]+)', r'"RegisterNumber": "\1"', p)
json.loads(sub)
This will match all the case where you have the RegisterNumber followed by numbers.
Upvotes: 2
Reputation: 1914
Read the file (best line by line) and replace all the values with their string representation. You can use regular expressions for that (re
module).
Then save and later parse the valid json.
If it fits into memory, you don't need to save the file of course, but just loads
the then valid json
string.
Here is a simple version:
import json
p = """[
{
"name": "Alice",
"RegisterNumber": 911100020001
},
{
"name": "Bob",
"RegisterNumber": 000111110300
}
]"""
from re import sub
p = sub(r"(\d{12})", "\"\\1\"", p)
arc = json.loads(p)
print(arc[1])
Upvotes: 5