Humaid Kidwai
Humaid Kidwai

Reputation: 99

Regexp to remove extra quotations marks so I can load the string as JSON -Python

So I'm getting some data in the form of a string as a response after I make a request using the requests library, which I wanna finally convert into JSON using json.loads() method. The string is quite messy so I have to clean it so that it can be loaded as a JSON object.

The string can have extra quotation marks like:

{"address":""home address 25"street",
"date":"""}

What I am trying is to create a regexp that helps me in removing these extra quotations so the result is:

{"address":"home address 25 street",
"date":""}

What I thought of was to first create a regexp for all valid quotation marks and then try to match my string for all patterns except the matched ones and then replace them with an empty string like ''enter image description here

Here's the regexp I tried but it fails to detect all valid quotations As shown in the image, the quotations above red dot are valid ones and should've been detected.

Note that the last red dot has two quotations above it, that's the kind of issue which I wanna solve.
Also ignore the blacked out part, that's sensitive info.

Upvotes: 1

Views: 1362

Answers (2)

user13843220
user13843220

Reputation:

You can probably just match all strings no matter what the content
as long as it is surrounded by a proper JSON structure.
Then replace double quotes accordingly from within a sub Callback Function.

The regex to match a pseudo-valid JSON string is this

r'([:\[,{]\s*)"(.*?)"(?=\s*[:,\]}])'

see https://regex101.com/r/vqn6e0/1

Within the callback use 2 regex to replace the quotes.

  • First one matches a quote that is not surrounded by other quotes
    r'(?<=[^"])"(?=[^"])' replace with a space.
  • Second one just replaces all quotes left with the empty string.

Python sample:

>>> import re
>>>
>>> text = '''
... {"address":""home address 25"street",
... "date":"""}
... '''
>>>
>>> def repl_call(m):
...     preq = m.group(1)
...     qbody = m.group(2)
...     qbody = re.sub( r'(?<=[^"])"(?=[^"])', ' ', qbody )
...     qbody = re.sub( r'"', '', qbody )
...     return preq + '"' + qbody + '"'
...
>>> print( re.sub( r'([:\[,{]\s*)"(.*?)"(?=\s*[:,\]}])', repl_call, text ))

{"address":"home address 25 street",
"date":""}

Upvotes: 1

Pramote Kuacharoen
Pramote Kuacharoen

Reputation: 1541

import re

str1 = '''
{"address":""home address 25"street",
"date":"""}

'''
# Remove all " and \n
str2 = re.sub(r'["\n]', ' ', str1)

# Find all key, value pairs
data = re.findall(r'([^{,:]+):([^,:}]+)', str2)

# Reconstruct a dictionary
result = {key.strip(): value.strip() for key, value in data}

print(result)

Upvotes: 1

Related Questions