Reputation: 71
I have a json file with size in 500 MB with structure like
{"user": "[email protected]","contact":[{"name":"Jack "John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}
Issue is when i parse this string with pandas read_json, I get error
Unexpected character found when decoding object value
Issue is happening because of double quotes "
in before John
in first elecment of contact array.
Need help in escaping this quote either with sed/awk/python so that i can directly load the file in pandas.
I am fine with ignoring contact with name as Jack as well but i can not ignore the complete row.
Upvotes: 0
Views: 865
Reputation: 203493
In general here is no robust way to fix this, nor to ignore any part of the row nor even to ignore the whole row because if a quoted string can contain quotes and can also contain :
s and ,
s then a messed up string can look exactly like a valid set of fields.
Having said that, if we target only the "name" field and assume that the associated string cannot contain ,
s then with GNU awk for the 3rd arg to match()
we can do:
$ cat tst.awk
{
while ( match($0,/(.*"name":")(([^,"]*"[^,"]*)+)(".*)/,a) ) {
gsub(/"/,"",a[2])
$0 = a[1] a[2] a[4]
}
print
}
$ awk -f tst.awk file
{"user": "[email protected]","contact":[{"name":"Jack O'Malley","number":"+1 23456789"},{"name":"Jack Jill Smith-Jones","number":"+1 232324789"}]}
{"user": "[email protected]","contact":[{"name":"X Æ A-12 Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}
I ran the above on this input file:
$ cat file
{"user": "[email protected]","contact":[{"name":"Jack "O'Malley","number":"+1 23456789"},{"name":"Jack "Jill "Smith-Jones","number":"+1 232324789"}]}
{"user": "[email protected]","contact":[{"name":"X Æ A-12 "Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}
and you can do the same using any awk:
$ cat tst.awk
{
while ( match($0,/"name":"([^,"]*"[^,"]*)+"/) ) {
tgt = substr($0,RSTART+8,RLENGTH-9)
gsub(/"/,"",tgt)
$0 = substr($0,1,RSTART+7) tgt substr($0,RSTART+RLENGTH-1)
}
print
}
$ awk -f tst.awk file
{"user": "[email protected]","contact":[{"name":"Jack O'Malley","number":"+1 23456789"},{"name":"Jack Jill Smith-Jones","number":"+1 232324789"}]}
{"user": "[email protected]","contact":[{"name":"X Æ A-12 Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}
Upvotes: 1
Reputation: 445
You can try removing that extra " using regex ("[\s\w]*)(")([\s\w]*")
.
The regex try to match any string with spaces/alphabets followed by a quote and then followed again by spaces/alphabets and removing that additional quote.
This will work for the given problem but for more complex patterns, you may have to tweak the regex.
Complete code:
import re
text_json = '{"user": "[email protected]","contact":[{"name":"Jack "John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}'
text_json = re.sub(r'("[\s\w]*)(")([\s\w]*")', r"\1\3", text_json )
print(text_json)
Output:
'{"user": "[email protected]","contact":[{"name":"Jack John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}'
Upvotes: 0