How to create a dynamic regex in Python?

Question

I have a loop running which picks values of countries one by one from a list. For current iteration, say x_3 = "United Kingdom' . Now, I want to search x_3 in a text txt_to_srch keeping in mind that 'United Kingdom' can be 'United Kingdom'(more than one space) or ' United Kingdom ' in the text. The word 'United Kingdom is present in txt_to_srch .

I have used the following code:

x_3 = '\s+'.join(x_3.split(" "))
x_3 = r"\b" + re.escape(x_3)+r"\b"
x2 = re.compile(x_3,re.IGNORECASE)
txt_to_srch = re.sub(r'
',' ',txt_to_srch)
txt_to_srch = re.sub(r'
',' ',txt_to_srch)
txt_to_srch = re.sub(r'	',' ',txt_to_srch)
y = re.findall(x2,txt_to_srch)

However, I am getting y as empty list.

Thierry Lathuille · Accepted Answer

Don't use re.escape that adds unwanted backslashes:

re.escape(pattern)

Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

Using re.escape on your first regex turns it into United\s\+Kingdom, which will try to match a litteral \ followed by an s between United and Kingdom.

Without it, your code works as expected:

import re

x_3 = "United Kingdom"

txt_to_srch = """Monty Pythons come from United Kingdom. They do.
United Kingdom is their home. Yes.
United Kingdom"""

x_3 = '\s+'.join(x_3.split(" "))
x_3 = r"\b" + x_3 +r"\b"
# print(x_3)
# \bUnited\s+Kingdom\bx2 = re.compile(x_3,re.IGNORECASE)
txt_to_srch = re.sub(r'
',' ',txt_to_srch)
txt_to_srch = re.sub(r'
',' ',txt_to_srch)
txt_to_srch = re.sub(r'	',' ',txt_to_srch)
y = re.findall(x2,txt_to_srch)

print(y)
# ['United Kingdom', 'United Kingdom', 'United Kingdom']

How to create a dynamic regex in Python?

Answers (1)

Related Questions