Reputation: 265
I am trying to locate items (one of them is the substring of the other) in sentences with regular expression, but it always locates the substring. For example, there are two items ["The Duke", "The Duke of A"] and some sentences:
The Duke
The Duke is a movie.
How is the movie The Duke?
The Duke of A
The Duke of A is a movie.
How is the movie The Duke of A?
What I want after finding the locations are:
The_Duke
The_Duke is a movie.
How is the movie The_Duke?
The_Duke_of_A
The_Duke_of_A is a movie.
How is the movie The_Duke_of_A?
The code I have tried is:
for sent in sentences:
for item in ["The Duke", "The Duke of A"]:
find = re.search(r'{0}'.format(item), sent)
if find:
sent = sent.replace(sent[find.start():find.end()], item.replace(" ", "_"))
But I got:
The_Duke
The_Duke is a movie.
How is the movie The_Duke?
The_Duke of A
The_Duke of A is a movie.
How is the movie The_Duke of A?
Changing the position of the items in the list is not suitable in my case, as I have a large list (over 10,000 items).
Upvotes: 0
Views: 108
Reputation: 27495
You can use re.sub
and the repl
can be a function so just replace the spaces in the results.
import re
with open("filename.txt") as sentences:
for line in sentences:
print(re.sub(r"The Duke of A|The Duke",
lambda s: s[0].replace(' ', '_'),
line))
This results in:
The_Duke
The_Duke is a movie.
How is the movie The_Duke?
The_Duke_of_A
The_Duke_of_A is a movie.
How is the movie The_Duke_of_A?
Upvotes: 1
Reputation: 99
Swap position of 'The Duke of A' and 'The Duke' in line:
for item in ["The Duke", "The Duke of A"]:
become
for item in ["The Duke of A", "The Duke"]:
Upvotes: 0
Reputation: 195468
If you cannot change position of the items in the list, you could try this version. In first pass we collect all matches and in the second pass we do the substitution:
data = '''The Duke
The Duke is a movie.
How is the movie The Duke?
The Duke of A
The Duke of A is a movie.
How is the movie The Duke of A?'''
terms = ["The Duke", "The Duke of A"]
import re
to_change = []
for t in terms:
for g in re.finditer(t, data):
to_change.append((g.start(), g.end()))
for (start, end) in to_change:
data = data[:start] + re.sub(r'\s', r'_', data[start:end]) + data[end:]
print(data)
Prints:
The_Duke
The_Duke is a movie.
How is the movie The_Duke?
The_Duke_of_A
The_Duke_of_A is a movie.
How is the movie The_Duke_of_A?
Upvotes: 0
Reputation: 31
What you are doing is first looking for "The Duke". If re find any match then you replaced it with "The_Duke". Now the second pass of the loop is looking for "The Duke of A" but re can't find any match because you have changed it previously.
This should work.
for sent in sentences:
for item in ["The Duke of A", "The Duke"]:
find = re.search(r'{0}'.format(item), sent)
if find:
sent = sent.replace(sent[find.start():find.end()], item.replace(" ", "_"))
Upvotes: 0