Reputation: 625
I have a loop running which picks values of countries one by one from a list. For current iteration, say x_3 = "United Kingdom' . Now, I want to search x_3 in a text txt_to_srch keeping in mind that 'United Kingdom' can be 'United Kingdom'(more than one space) or '\nUnited Kingdom\n' in the text. The word 'United Kingdom is present in txt_to_srch .
I have used the following code:
x_3 = '\s+'.join(x_3.split(" "))
x_3 = r"\b" + re.escape(x_3)+r"\b"
x2 = re.compile(x_3,re.IGNORECASE)
txt_to_srch = re.sub(r'\n',' ',txt_to_srch)
txt_to_srch = re.sub(r'\r',' ',txt_to_srch)
txt_to_srch = re.sub(r'\t',' ',txt_to_srch)
y = re.findall(x2,txt_to_srch)
However, I am getting y as empty list.
Upvotes: 0
Views: 293
Reputation: 24282
Don't use re.escape
that adds unwanted backslashes:
re.escape(pattern)
Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
Using re.escape on your first regex turns it into United\\s\+Kingdom
, which will try to match a litteral \
followed by an s
between United
and Kingdom
.
Without it, your code works as expected:
import re
x_3 = "United Kingdom"
txt_to_srch = """Monty Pythons come from United Kingdom. They do.
United Kingdom is their home. Yes.
United Kingdom"""
x_3 = '\s+'.join(x_3.split(" "))
x_3 = r"\b" + x_3 +r"\b"
# print(x_3)
# \bUnited\s+Kingdom\bx2 = re.compile(x_3,re.IGNORECASE)
txt_to_srch = re.sub(r'\n',' ',txt_to_srch)
txt_to_srch = re.sub(r'\r',' ',txt_to_srch)
txt_to_srch = re.sub(r'\t',' ',txt_to_srch)
y = re.findall(x2,txt_to_srch)
print(y)
# ['United Kingdom', 'United Kingdom', 'United Kingdom']
Upvotes: 1