Datadev
Datadev

Reputation: 71

escape json string having double quotes

I have a json file with size in 500 MB with structure like

{"user": "[email protected]","contact":[{"name":"Jack "John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}

Issue is when i parse this string with pandas read_json, I get error

Unexpected character found when decoding object value

Issue is happening because of double quotes " in before John in first elecment of contact array.

Need help in escaping this quote either with sed/awk/python so that i can directly load the file in pandas.

I am fine with ignoring contact with name as Jack as well but i can not ignore the complete row.

Upvotes: 0

Views: 865

Answers (2)

Ed Morton
Ed Morton

Reputation: 203493

In general here is no robust way to fix this, nor to ignore any part of the row nor even to ignore the whole row because if a quoted string can contain quotes and can also contain :s and ,s then a messed up string can look exactly like a valid set of fields.

Having said that, if we target only the "name" field and assume that the associated string cannot contain ,s then with GNU awk for the 3rd arg to match() we can do:

$ cat tst.awk
{
    while ( match($0,/(.*"name":")(([^,"]*"[^,"]*)+)(".*)/,a) ) {
        gsub(/"/,"",a[2])
        $0 = a[1] a[2] a[4]
    }
    print
}

$ awk -f tst.awk file
{"user": "[email protected]","contact":[{"name":"Jack O'Malley","number":"+1 23456789"},{"name":"Jack Jill Smith-Jones","number":"+1 232324789"}]}
{"user": "[email protected]","contact":[{"name":"X Æ A-12 Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}

I ran the above on this input file:

$ cat file
{"user": "[email protected]","contact":[{"name":"Jack "O'Malley","number":"+1 23456789"},{"name":"Jack "Jill "Smith-Jones","number":"+1 232324789"}]}
{"user": "[email protected]","contact":[{"name":"X Æ A-12 "Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}

and you can do the same using any awk:

$ cat tst.awk
{
    while ( match($0,/"name":"([^,"]*"[^,"]*)+"/) ) {
        tgt = substr($0,RSTART+8,RLENGTH-9)
        gsub(/"/,"",tgt)
        $0 = substr($0,1,RSTART+7) tgt substr($0,RSTART+RLENGTH-1)
    }
    print
}

$ awk -f tst.awk file
{"user": "[email protected]","contact":[{"name":"Jack O'Malley","number":"+1 23456789"},{"name":"Jack Jill Smith-Jones","number":"+1 232324789"}]}
{"user": "[email protected]","contact":[{"name":"X Æ A-12 Musk","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}

Upvotes: 1

DumbCoder
DumbCoder

Reputation: 445

You can try removing that extra " using regex ("[\s\w]*)(")([\s\w]*").

The regex try to match any string with spaces/alphabets followed by a quote and then followed again by spaces/alphabets and removing that additional quote.

This will work for the given problem but for more complex patterns, you may have to tweak the regex.

Complete code:

import re

text_json = '{"user": "[email protected]","contact":[{"name":"Jack "John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}'
text_json = re.sub(r'("[\s\w]*)(")([\s\w]*")', r"\1\3", text_json )

print(text_json)

Output:

'{"user": "[email protected]","contact":[{"name":"Jack John","number":"+1 23456789"},{"name":"Jack Jill","number":"+1 232324789"}]}'

Upvotes: 0

Related Questions