IMAbev
IMAbev

Reputation: 188

Regex skipping to specific phrases

What is the best way to get regex to 'read' through characters and stop at specific phrases for a capture? A lot of time I have used .*? and .+? to get through unwanted characters to a specific string or tag and then capture.

I want to read through any character until I get to a specific phrase or tag. I would typically do some thing like

date.*?<.*?>(\w+)<.*?>

from a string that looks like

datestuffstuffstuffstuff<tag>animal<tag>

That would work in a simple example but the engine gets lost in 10K character text to match. Do I need to be more specific when I get to the capture? Regex Plain English: skip characters until you get to this phrase and then capture.

Upvotes: 1

Views: 45

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

Since you are asking how to parse plain text, I can suggest using negated character classes, i.e. [^+CHARACTERS_THAT_SHOULD_NOT_BE_MATCHED+].

Negated character classes are the most efficient regex subpatterns. Consider

word one#word 2#more text

The #(.*?)# will take 18 steps to find a match, and #(\[^#\]*)# will do it in 6 steps.

Also, . does not match a newline by default, you need to specify DOTALL mode with (?s), /s, or other means in different flavors.

If you need to match some unnecessary text between 2 or more required characters, then you will have to either use .*/.*? (with or without dotall modifier), or - if you need the closest match - a tempered greedy token (especially, if some substring must be excluded).

Upvotes: 1

Related Questions