Just a learner
Just a learner

Reputation: 28602

Python user input as regular expression, how to do it correctly?

I'm using Python 3. In my application, the use can input a regular expression string directly and the application will use it to match some strings. For example the user can type \t+. However I can't make it work as I can't correctly convert it to a correct regular expression. I've tried and below is my code.

>>> import re
>>> re.compile(re.escape("\t+")).findall("  ")
[]

However when I change the regex string to \t, it will work.

>>> re.compile(re.escape("\t")).findall("   ")
['\t']

Note the parameter to findall IS a tab character. I don't know why it seems not correctly displayed in Stackoverflow.

Anyone can point me the right direction to solve this? Thanks.

Upvotes: 5

Views: 13569

Answers (3)

Bruno Lubascher
Bruno Lubascher

Reputation: 2121

Compile user input

I assume that the user input is a string, wherever it comes from your system:

user_input = input("Input regex:")  # check console, it is expecting your input
print("User typed: '{}'. Input type: {}.".format(user_input, type(user_input)))

This means that you need to transform it to a regex, and that is what the re.compile is for. If you use re.compile and you don't provide a valid str to be converted to a regex, it will throw an error.

Therefore, you can create a function to check if the input is valid or not. You used the re.escape, so I added a flag to the function to use re.escape or not.

def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
    try:
        if escape: re.compile(re.escape(regex_from_user))
        else: re.compile(regex_from_user)
        is_valid = True
    except re.error:
        is_valid = False
    return is_valid

print("If you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))

If your user input is: \t+, you will get:

>> If you don't use re.escape, the input is valid: True.
>> If you do use re.escape, the input is valid: True.

However, if your user input is: [\t+, you will get:

>> If you don't use re.escape, the input is valid: False.
>> If you do use re.escape, the input is valid: True.

Notice that it was indeed an invalid regex, however, by using re.escape your regex becomes valid. That is because re.escape escapes all your special characters, treating them as literal characters. So in the case that you have \t+, if you use re.escape you will be looking for a sequence of characters: \, t, + and not for a tab character.

Checking your lookup string

Take the string you want to look into. For example, here is a string where the character between quotes is supposed to be a tab:

string_to_look_in = 'This is a string with a "  " tab character.'

You can manually check for tabs by using the repr function.

print(string_to_look_in)
print(repr(string_to_look_in))
>> This is a string with a "    " tab character.
>> 'This is a string with a "\t" tab character.'

Notice that by using repr the \t representation of the tab character gets displayed.

Test script

Here is a script for you to try all these things:

import re

string_to_look_in = 'This is a string with a "  " tab character.'
print("String to look into:", string_to_look_in)
print("String to look into:", repr(string_to_look_in), "\n")

user_input = input("Input regex:")  # check console, it is expecting your input

print("\nUser typed: '{}'. Input type: {}.".format(user_input, type(user_input)))


def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
    try:
        if escape: re.compile(re.escape(regex_from_user))
        else: re.compile(regex_from_user)
        is_valid = True
    except re.error:
        is_valid = False
    return is_valid

print("\nIf you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))

if is_valid_regex(user_input, escape=False):
    regex = re.compile(user_input)
    print("\nRegex compiled as '{}' with type {}.".format(repr(regex), type(regex)))

    matches = regex. findall(string_to_look_in)
    print('Mathces found:', matches)

else:
    print('\nThe regex was not valid, so no matches.')

Upvotes: 9

Kurtis Rader
Kurtis Rader

Reputation: 7469

A literal \t+ from an external source is not the same thing as the literal string "\t+". What does print("\t+") output? What about print(r"\t+")? The latter is equivalent to accepting that literal string as input to be used as a regex. The former is not. However, for this specific situation the distinction does not matter since a literal tab character should behave exactly the same as \t in a regex. Ponder the following examples from an Ipython session:

In [24]: re.compile('\t+').findall('^I')
Out[24]: ['\t']

In [25]: re.compile('\t+').findall("\t")
Out[25]: ['\t']

In [26]: re.compile(r'\t+').findall('^I')
Out[26]: ['\t']

In [27]: re.compile(r'\t+').findall("\t")
Out[27]: ['\t']

In [28]: re.compile(r'\t+').findall(r"\t")
Out[28]: []

I can only conclude your first example, the one which didn't produce the expected output, did not have a literal tab in the quoted string.

Also, re.escape() is not appropriate for this situation. Its purpose is to ensure that a string from an untrusted source is treated literally rather than a regex so that it can safely be used as a literal string to be matched.

Upvotes: 2

DYZ
DYZ

Reputation: 57085

The result of re.escape("\t+") is '\\\t\\+'. Note that the + sign is escaped with a backslash and is not a special character anymore. It does not mean "one or more tabs."

Upvotes: 1

Related Questions