ashap551
ashap551

Reputation: 63

Using REGEX to Handle Nested Double Quotes in JSON Strings in Python

I'm using Generative AI API to return text responses as JSON strings which I intend to feed data into an application in real time. The problem is that often the JSON response provided by GenAI API includes small errors- most commonly with double quotes. These syntax issues in the response JSON string trigger errors in my python code when converting them to JSON.

For instance, I have the following JSON string:
'{"test":"this is "test" of "a" test"","result":"your result is "out" in our website"}'

As you can see, the value for "test" has multiple double quotations. So if I try to convert this to json, I get an error. What I want to do is utilize regex to convert the double quotations to single quotations. So the result can look as follows:
'{"test":"this is 'test' of 'a' test'", "result": "your result is 'out' in our website"}'

The best I can do is as follows:

def repl_call(m):
    preq = m.group(1)
    qbody = m.group(2)
    qbody = re.sub( r'"', "'", qbody )
    return preq + '"' + qbody + '"'

print( re.sub( r'([:\[,{]\s*)"(.*?)"(?=\s*[:,\]}])', repl_call, text ))

The following code successfully returns the intended result. However, if I were to add a comma, such as
{"test":"this is "test" of "a", test"","result":"your result is "out" in our website"}

...the code breaks and returns the following:
'{"test":"this is 'test' of 'a", test"","result":"your result is 'out' in our website"}'

:(

I've presently have tried to improve my AI prompt (prompt engineering) to avoid the double quotations and return only a valid JSON string. This works to some degree, but I still encounter enough errors in syntax that require me to retry the same prompt multiple times- which incurs unnecessary delays and costs.

My question is: Is there such thing as a common function and REGEX pattern I can apply in python to fix my JSON string so that it properly cleanses syntax errors? Specifically relating to misplaced double quotes.

I'm open to a variety of suggestions, including possible Python packages that can deal with JSON string cleansing. Even any advice on advanced GenAI tools that do JSON enforcement. I presently use Gemeni- which I like a lot. But doesn't allow JSON enforcement like OpenAI's API allows more explicitly.

Upvotes: 1

Views: 103

Answers (1)

Linda Lawton - DaImTo
Linda Lawton - DaImTo

Reputation: 117176

If you are requesting JSon back you should be using the response_mime_type and then you will not have these issues with parsing the JSon.

from dotenv import load_dotenv
import google.generativeai as genai
import os

load_dotenv()
genai.configure(api_key=os.environ['API_KEY'])
MODEL_NAME_LATEST = os.environ['MODEL_NAME_LATEST']

model = genai.GenerativeModel(
    model_name=MODEL_NAME_LATEST,
    # Set the `response_mime_type` to output JSON
    generation_config={"response_mime_type": "application/json"})

prompt = """
  List 5 popular cookie recipes.
  Using this JSON schema:
    Recipe = {"recipe_name": str}
  Return a `list[Recipe]`
  """

response = model.generate_content(prompt)
print(response.text)

Just remember to ensure that the JSon object you tell it to use is actually correct JSon or it may build it incorrectly include all , where they should be

response schema

Another option would be to use response schema.

from dotenv import load_dotenv
import google.generativeai as genai
import os
import typing_extensions as typing

load_dotenv()
genai.configure(api_key=os.environ['API_KEY'])
MODEL_NAME_LATEST = os.environ['MODEL_NAME_LATEST']


class Recipe(typing.TypedDict):
    recipe_name: str


model = genai.GenerativeModel(
    model_name=MODEL_NAME_LATEST,
    # Set the `response_mime_type` to output JSON
    # Pass the schema object to the `response_schema` field
    generation_config={"response_mime_type": "application/json",
                       "response_schema": list[Recipe]})

prompt = "List 5 popular cookie recipes"

response = model.generate_content(prompt)
print(response.text)

see Json mode

Upvotes: 3

Related Questions