Reputation: 11697

Identify all instances of problematic quotation marks

I have a (properly formed) large string variable that I turn into lists of dictionaries. I iterate over the massive string, split by newline characters, and run the following list(eval(i)). This works for the majority of the cases, but for every exception thrown, I add the 'malformed' string into a failed_attempt array. I have been inspecting the failed cases for an hour now, and believe what causes them to fail is whenever there is an extra quotation mark that is not part of the keys for a dictionary. For example,

eval('''[{"question":"What does "AR" stand for?","category":"DFB","answers":["Assault Rifle","Army Rifle","Automatic Rifle","Armalite Rifle"],"sources":["https://www.npr.org/2018/02/28/588861820/a-brief-history-of-the-ar-15"]}]''')

Will fail because there is quotation marks around the "AR." If you replace the quotation marks with single quotation marks, e.g.

eval('''[{"question":"What does 'AR' stand for?","category":"DFB","answers":["Assault Rifle","Army Rifle","Automatic Rifle","Armalite Rifle"],"sources":["https://www.npr.org/2018/02/28/588861820/a-brief-history-of-the-ar-15"]}]''')

It now succeeds.

Similarly:

eval('''[{"question":"Test Question, Test Question?","category":"DFB","answers":["2004","1930","1981","This has never occurred"],"sources":[""SOWELL: Exploding myths""]}]''')

Fails due to the quotes around "Sowell", but again succeeds if you replace them with single quotes.

So I need a way to identify quotes that appear anywhere other than around the keys of the dictionary (question, category, sources) and replace them with single quotes. I'm not sure the right way to do this.

@Wiktor's submission nearly does the trick, but will fail on the following:

example = '''[{"question":"Which of the following is NOT considered to be "interstate commerce" by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'''
re.sub(r'("\w+":[[{]*")(.*?)("(?:,|]*}))', lambda x: "{}{}{}".format(x.group(1),x.group(2).replace('"', "'"),x.group(3)), example)


Out[170]: '[{"question":"Which of the following is NOT considered to be \'interstate commerce\' by the Supreme Court, and this cannot be regulated by Congress?","category":"DFB","answers":["ANSWER 1","ANSWER 2","ANSWER 3","All of these are considered "Interstate Commerce""],"sources":["SOURCE 1","SOURCE 2","SOURCE 3"]}]'

Notice that the second set of double quotation marks on "Interstate Commerce" in the answers is not replaced.

Upvotes: 1

Answers (3)

Charif DZ

Reputation: 14751

Try this I know this will work for all question and category key value, and I hope I didn't forgot any case for the lists value:

import re


def escape_quotes(match):
    """ espace normal quotes captured by the second group."""
    # match any quote except this quotes : `["` or `","` or `"]`
    RE_ESACEP_QUOTES_IN_LIST = re.compile('(?<!\[)(?<!",)"(?!,"|\])')

    def escape_quote_in_string(string):
        return '"{}"'.format(string[1:-1].replace('"', "'"))

    key, value = match.groups()
    # this will fix for sure the problem related to this keys
    if any(e in key for e in ('question', 'category')):
        value = escape_quote_in_string(value)
    if any(e in key for e in ('answers', 'sources')):
        # keep only [" or "," or "]  escape any thing else
        value = RE_ESACEP_QUOTES_IN_LIST.sub(r"'", value)

    return f'{key}{value}'


# test cases
exps = ['''[{"question":"What does "AR" stand for?"}]''',
        '''[{"sources":[""SOWE"LL: Ex"ploding myths""]}]''',
        '''[{"question":"Test ", Test" Que"sti"on?","sources":[""SOWELL: Ex""ploding myths""]}]''']

# extract key value of the expression you made it easy by specifying that key are fixed
key = '(?:"(?:question|category|answers|sources)":)'
RE_KEY_VALUE = re.compile(rf'({key})(.+?)\s*(?=,\s*{key}|}})', re.S)

# test all cases
for exp in exps:
    # escape normal quotes
    exp = RE_KEY_VALUE.sub(escape_quotes, exp)
    print(eval(exp))

# [{'question': "What does 'AR' stand for?"}]
# [{'sources': ["'SOWE'LL: Ex'ploding myths'"]}]
# [{'question': "Test ', Test' Que'sti'on?", 'sources': ["'SOWELL: Ex''ploding myths'"]}]

Upvotes: 1

DisappointedByUnaccountableMod

Reputation: 6826

Rather than converting the values extracted from this monster string back into a string representation of a list and then using eval(), simply take the things you get in variables and simply append the variables to the list.

Or construct a dict frpom the values rather than creating a string representation of a dictionary then evaluating it.

It doesn't help that you haven't put any code in your question, so these answers are sketchy. If you put a https://stackoverflow.com/help/minimal-reproducible-example in your question, with some minimal data - very minimal - a good one that doesn't cause an exception in eval() and a bad example that recreates the problem, then I should be able to better suggest how to apply my answer.

Your code must be doing something a bit like this:

import traceback

sourcesentences = [
     'this is no problem'
     ,"he said 'That is no problem'" 
     ,'''he said "It's a great day"''' 
]

# this is doomed if there is a single or double quote in the sentence
for sentence in sourcesentences:
    words = sentence.split()
    myliststring="[\""+"\",\"".join(words)+"\"]"    
    print( f"The sentence is >{sentence}<" )
    print( f"my string representation of the sentence is >{myliststring}<" )
    try:
        mylistfromstring = eval(myliststring)
        print( f"my list is >{mylistfromstring}<" )
    except SyntaxError as e:
        print( f"eval failed with SyntaxError on >{myliststring}<")
        traceback.print_exc()
    print()

And this produces a SyntaxError on the third test sentence

Now let's try escaping characters in the variable before wrapping them in quotation marks:

# this adapts to a quote within the string
def safequote(s):
    if '"' in s:
        s = s.replace( '"','\\"' )
    return s

for sentence in sourcesentences:
    print( f"The sentence is >{sentence}<" )
    words = [safequote(s) for s in sentence.split()]
    myliststring="[\""+"\",\"".join(words)+"\"]"    
    print( f"my string representation of the sentence is >{myliststring}<" )
    try:
        mylistfromstring = eval(myliststring)
        print( f"my list is >{mylistfromstring}<" )
    except SyntaxError as e:
        print( f"eval failed with SyntaxError on >{myliststring}<")
        traceback.print_exc()
    print()

This works, but is there a better way?

Isn't it a lot simpler avoiding eval which means avoiding constructing a string representation of the list which means avoiding problems with quotation marks in the text:

for sentence in sourcesentences:
    print( f"The sentence is >{sentence}<" )
    words = sentence.split()
    print( f"my list is >{words}<" )
    print()

Upvotes: 1

zipa

Reputation: 27879

If your text is stored in variable somehow, say in variable text, you can use the re.sub():

re.sub('(\s")|("\s)', ' ', text)

Upvotes: 0

Identify all instances of problematic quotation marks

Answers (3)

Related Questions