Luise
Luise

Reputation: 654

python regex: how can I get the smallest substring from a certain word to the end of the text?

I'm analyzing a text and I'd like to extract the smallest substring starting from the occurrence of a certain word until the end of the text. My particular problem is that that word can be in several parts of my text.

I've tried the following:

pattern = re.compile('(word)(.*?)$', re.DOTALL)
result = re.search(pattern, MY_TEXT).group()

My problem is that this doesn't result in the smallest possible string being returned, but in the largest string found in the text (i.e: the first occurrence of word until the end of the text, instead of the last occurrence). I was sure that adding the ? character after .* inside the second parenthesis would have solved the problem, but it didn't.

Example input:

text = "Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.\nThe franchise began as Pokémon Red and Green (later released outside of Japan as Pokémon Red and Blue)."
word = 'Pokémon'

I'd expect my result to be the string: Pokémon Red and Blue)., but right now I'm getting the whole text as a result.

How can I get what I expect? Thanks in advance.

Upvotes: 1

Views: 348

Answers (2)

The fourth bird
The fourth bird

Reputation: 163277

Your current pattern (Pokémon)(.*?)$ has 2 capturing groups where it will only match the first occurrence of word because the second group follows by matching until the end of the string.

To get to the last word, you could use .*Pokémon as .* will first match until the end of the string and will backtrack until it can fit Pokémon.

Then the rest of the string will be matched by the following .* The value is in the first capturing group.

^.*(Pokémon .*)$

Regex demo | Python demo

To create a more dynamic pattern

text = "Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.\nThe franchise began as Pokémon Red and Green (later released outside of Japan as Pokémon Red and Blue)."
word = "and"
pattern = r"^.*(" + re.escape(word) + ".*)$"
regex = re.compile(pattern, re.DOTALL)
result = re.search(regex, text).group(1)
print(result)

Result

and Blue).

If the word can also be the last word in the sentence, you could assert what is on the right is not a non whitespace char (?!\S) using a negative lookahead.

^.*(Pokémon(?!\S).*)$

Regex demo

Upvotes: 2

Emma
Emma

Reputation: 27723

I'm guessing that you wish to extract the last instance of Pokémon to the end of the input string, which this expression for instance

^.*(Pokémon.*)$

is likely to do so.

DEMO

Upvotes: 1

Related Questions