Reputation: 654
I'm analyzing a text and I'd like to extract the smallest substring starting from the occurrence of a certain word until the end of the text. My particular problem is that that word can be in several parts of my text.
I've tried the following:
pattern = re.compile('(word)(.*?)$', re.DOTALL)
result = re.search(pattern, MY_TEXT).group()
My problem is that this doesn't result in the smallest possible string being returned, but in the largest string found in the text (i.e: the first occurrence of word
until the end of the text, instead of the last occurrence). I was sure that adding the ?
character after .*
inside the second parenthesis would have solved the problem, but it didn't.
Example input:
text = "Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.\nThe franchise began as Pokémon Red and Green (later released outside of Japan as Pokémon Red and Blue)."
word = 'Pokémon'
I'd expect my result to be the string: Pokémon Red and Blue).
, but right now I'm getting the whole text as a result.
How can I get what I expect? Thanks in advance.
Upvotes: 1
Views: 348
Reputation: 163277
Your current pattern (Pokémon)(.*?)$
has 2 capturing groups where it will only match the first occurrence of word
because the second group follows by matching until the end of the string.
To get to the last word, you could use .*Pokémon
as .*
will first match until the end of the string and will backtrack until it can fit Pokémon
.
Then the rest of the string will be matched by the following .*
The value is in the first capturing group.
^.*(Pokémon .*)$
To create a more dynamic pattern
text = "Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.\nThe franchise began as Pokémon Red and Green (later released outside of Japan as Pokémon Red and Blue)."
word = "and"
pattern = r"^.*(" + re.escape(word) + ".*)$"
regex = re.compile(pattern, re.DOTALL)
result = re.search(regex, text).group(1)
print(result)
Result
and Blue).
If the word can also be the last word in the sentence, you could assert what is on the right is not a non whitespace char (?!\S)
using a negative lookahead.
^.*(Pokémon(?!\S).*)$
Upvotes: 2