Reputation: 28602
I'm using Python 3. In my application, the use can input a regular expression string directly and the application will use it to match some strings. For example the user can type \t+
. However I can't make it work as I can't correctly convert it to a correct regular expression. I've tried and below is my code.
>>> import re
>>> re.compile(re.escape("\t+")).findall(" ")
[]
However when I change the regex string to \t
, it will work.
>>> re.compile(re.escape("\t")).findall(" ")
['\t']
Note the parameter to findall
IS a tab character. I don't know why it seems not correctly displayed in Stackoverflow.
Anyone can point me the right direction to solve this? Thanks.
Upvotes: 5
Views: 13569
Reputation: 2121
I assume that the user input is a string, wherever it comes from your system:
user_input = input("Input regex:") # check console, it is expecting your input
print("User typed: '{}'. Input type: {}.".format(user_input, type(user_input)))
This means that you need to transform it to a regex, and that is what the re.compile
is for. If you use re.compile
and you don't provide a valid str
to be converted to a regex, it will throw an error.
Therefore, you can create a function to check if the input is valid or not. You used the re.escape
, so I added a flag to the function to use re.escape
or not.
def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
try:
if escape: re.compile(re.escape(regex_from_user))
else: re.compile(regex_from_user)
is_valid = True
except re.error:
is_valid = False
return is_valid
print("If you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))
If your user input is: \t+
, you will get:
>> If you don't use re.escape, the input is valid: True.
>> If you do use re.escape, the input is valid: True.
However, if your user input is: [\t+
, you will get:
>> If you don't use re.escape, the input is valid: False.
>> If you do use re.escape, the input is valid: True.
Notice that it was indeed an invalid regex, however, by using re.escape
your regex becomes valid. That is because re.escape
escapes all your special characters, treating them as literal characters. So in the case that you have \t+
, if you use re.escape
you will be looking for a sequence of characters: \
, t
, +
and not for a tab character
.
Take the string you want to look into. For example, here is a string where the character between quotes is supposed to be a tab:
string_to_look_in = 'This is a string with a " " tab character.'
You can manually check for tabs by using the repr
function.
print(string_to_look_in)
print(repr(string_to_look_in))
>> This is a string with a " " tab character.
>> 'This is a string with a "\t" tab character.'
Notice that by using repr
the \t
representation of the tab character gets displayed.
Here is a script for you to try all these things:
import re
string_to_look_in = 'This is a string with a " " tab character.'
print("String to look into:", string_to_look_in)
print("String to look into:", repr(string_to_look_in), "\n")
user_input = input("Input regex:") # check console, it is expecting your input
print("\nUser typed: '{}'. Input type: {}.".format(user_input, type(user_input)))
def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
try:
if escape: re.compile(re.escape(regex_from_user))
else: re.compile(regex_from_user)
is_valid = True
except re.error:
is_valid = False
return is_valid
print("\nIf you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))
if is_valid_regex(user_input, escape=False):
regex = re.compile(user_input)
print("\nRegex compiled as '{}' with type {}.".format(repr(regex), type(regex)))
matches = regex. findall(string_to_look_in)
print('Mathces found:', matches)
else:
print('\nThe regex was not valid, so no matches.')
Upvotes: 9
Reputation: 7469
A literal \t+
from an external source is not the same thing as the literal string "\t+"
. What does print("\t+")
output? What about print(r"\t+")
? The latter is equivalent to accepting that literal string as input to be used as a regex. The former is not. However, for this specific situation the distinction does not matter since a literal tab character should behave exactly the same as \t
in a regex. Ponder the following examples from an Ipython session:
In [24]: re.compile('\t+').findall('^I')
Out[24]: ['\t']
In [25]: re.compile('\t+').findall("\t")
Out[25]: ['\t']
In [26]: re.compile(r'\t+').findall('^I')
Out[26]: ['\t']
In [27]: re.compile(r'\t+').findall("\t")
Out[27]: ['\t']
In [28]: re.compile(r'\t+').findall(r"\t")
Out[28]: []
I can only conclude your first example, the one which didn't produce the expected output, did not have a literal tab in the quoted string.
Also, re.escape()
is not appropriate for this situation. Its purpose is to ensure that a string from an untrusted source is treated literally rather than a regex so that it can safely be used as a literal string to be matched.
Upvotes: 2
Reputation: 57085
The result of re.escape("\t+")
is '\\\t\\+'
. Note that the + sign is escaped with a backslash and is not a special character anymore. It does not mean "one or more tabs."
Upvotes: 1