user1995
user1995

Reputation: 538

How to use stringed regex as proper regex with raw literalization

I have a list of regexes in string form (created after parsing natural language text which were search queries). I want to use them for searching text now. Here is how I am doing it right now-

# given that regex_list=["r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'", "r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'"....]
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
    mo=re.search(string_regex %gs1,sent,re.I)
    if mo:
        print(mo.group())

What I need is to be able to use these string regexes, but also have Python's raw literal notation on them, as we all should for regex queries. Now about these expressions - I have natural text search commands like -

LINE_CONTAINS foo(+)

Which I use pyparsing to convert to regex like r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))' based on a grammar. I send a list of these human rules to the pyparsing code and it gives me back a list of ~100 of these regexes. These regexes are constructed in string format.

This is the MCVE version of the code that generates these strings that are supposed to act as regexes -

from pyparsing import *
import re


def parse_hrr(received_sentences):
    UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
    LBRACE,RBRACE = map(Suppress, "{}")
    integer = pyparsing_common.integer()

    LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
        """LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
    keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH


    class Node(object):
        def __init__(self, tokens):
            self.tokens = tokens

        def generate(self):
            pass

    class LiteralNode(Node):
        def generate(self):
            return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
        def __repr__(self):
            return repr(self.tokens[0])

    class ConsecutivePhrases(Node):
        def generate(self):
            join_these=[]
            tokens = self.tokens[0]
            for t in tokens:
                tg = t.generate()
                join_these.append(tg)
            seq = []
            for word in join_these[:-1]:
                if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
                    seq.append(word + "")
                else:
                    seq.append(word + "\s+")
            seq.append(join_these[-1])
            result = "".join(seq)
            return result

    class AndNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            join_these=[]
            for t in tokens[::2]:
                tg = t.generate()
                tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
                join_these.append(tg_mod)
            joined = ''.join(ele for ele in join_these)
            full = '('+ joined+')'
            return full

    class OrNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            joined = '|'.join(t.generate() for t in tokens[::2])
            full = '('+ joined+')'
            return full

    class LineTermNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = ''
            dir_phr_map = {
                'LINE_CONTAINS': lambda a:  r"((?:(?<=[\W_])" + a + r"(?=[\W_]|$))456", #%gs1, sent, re.I)",
                'PARA_STARTSWITH':
                    lambda a: ("r'(^" + a + "(?=[\W_]|$))' 457") if 'gene' in repr(a) #%gs1, s, re.I)"
                    else ("r'(^" + a + "(?=[\W_]|$))' 458")} #,s, re.I
            for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
                ret = dir_phr_map[line_dir](phr_term.generate())
            return ret

## THE GRAMMAR
    word = ~keyword + Word(alphas, alphanums+'-_+/()')
    some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
    phrase_item = some_words

    phrase_expr = infixNotation(phrase_item,
                                [
                                (None, 2, opAssoc.LEFT, ConsecutivePhrases),
                                (AND, 2, opAssoc.LEFT, AndNode),
                                (OR, 2, opAssoc.LEFT, OrNode),
                                ],
                                lpar=Suppress('{'), rpar=Suppress('}')
                                ) # structure of a single phrase with its operators

    line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
                      (phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase

    line_contents_expr = line_term.setParseAction(LineTermNode)
###########################################################################################
    mrrlist=[]
    for t in received_sentences:
        t = t.strip()
        try:
            parsed = line_contents_expr.parseString(t)

        temp_regex = parsed[0].generate()
        mrrlist.append(temp_regex)
    return(mrrlist)

So basically, the code is stringing together the regex. Then I add the necessary parameters like re.search, %gs1 etc .to have the complete regex search query. I want to be able to use these string regexes for searching, hence I had earlier thought eval() would convert the string to its corresponding Python expression here, which is why I used it - I was wrong.

TL;DR - I basically have a list of strings that have been created in the source code, and I want to be able to use them as regexes, using Python's raw literal notation.

Upvotes: 2

Views: 147

Answers (1)

Blckknght
Blckknght

Reputation: 104762

Your issue seems to stem from a misunderstanding of what raw string literals do and what they're for. There's no magic raw string type. A raw string literal is just another way of creating a normal string. A raw literal just gets parsed a little bit differently.

For instance, the raw string r"\(foo\)" can also be written "\\(foo\\)". The doubled backslashes tell Python's regular string parsing algorithm that you want an actual backslash character in the string, rather than the backslash in the literal being part of an escape sequence that gets replaced by a special character. The raw string algorithm doesn't the extra backslashes since it never replaces escape sequences.

However, in this particular case the special treatment is not actually necessary, since the \( and \) are not meaningful escape sequences in a Python string. When Python sees an invalid escape sequence, it just includes it literally (backslash and all). So you could also use "\(foo\)" (without the r prefix) and it will work just fine too.

But it's not generally a good idea to rely upon backslashes being ignored however, since if you edit the string later you might inadvertently add an escape sequence that Python does understand (when you really wanted the raw, un-transformed version). Since regex syntax has a number of its own escape sequences that are also escape sequences in Python (but with different meanings, such as \b and \1), it's a best practice to always write regex patterns with raw strings to avoid introducing issues when editing them.

Now to bring this around to the example code you've shown. I have no idea why you're using eval at all. As far as I can tell, you've mistakenly wrapped extra quotes around your regex patterns for no good reason. You're using exec to undo that wrapping. But because only the inner strings are using raw string syntax, by the time you eval them you're too late to avoid Python's string parsing messing up your literals if you have any of the troublesome escape sequences (the outer string will have already parsed \b for instance and turned it into the ASCII backspace character \x08).

You should tear the exec code out and fix your literals to avoid the extra quotes. This should work:

regex_list=[r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))',   # use raw literals, with no extra quotes!
            r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'] # unnecessary backslashes?

sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
    mo=re.search(string_regex %gs1,sent,re.I)    # no eval here!
    if mo:
        print(mo.group())

This example works for me (it prints foo(+)). Note that you've got some extra unnecessary backslashes in your second pattern (before the spaces). Those are harmless, but might be adding even more confusion to a complicate subject (regex are notoriously hard to understand).

Upvotes: 1

Related Questions