Reputation: 188
What is the best way to get regex to 'read' through characters and stop at specific phrases for a capture? A lot of time I have used .*? and .+? to get through unwanted characters to a specific string or tag and then capture.
I want to read through any character until I get to a specific phrase or tag. I would typically do some thing like
date.*?<.*?>(\w+)<.*?>
from a string that looks like
datestuffstuffstuffstuff<tag>animal<tag>
That would work in a simple example but the engine gets lost in 10K character text to match. Do I need to be more specific when I get to the capture? Regex Plain English: skip characters until you get to this phrase and then capture.
Upvotes: 1
Views: 45
Reputation: 626870
Since you are asking how to parse plain text, I can suggest using negated character classes, i.e. [^
+CHARACTERS_THAT_SHOULD_NOT_BE_MATCHED+]
.
Negated character classes are the most efficient regex subpatterns. Consider
word one#word 2#more text
The #(.*?)#
will take 18 steps to find a match, and #(\[^#\]*)#
will do it in 6 steps.
Also, .
does not match a newline by default, you need to specify DOTALL mode with (?s)
, /s
, or other means in different flavors.
If you need to match some unnecessary text between 2 or more required characters, then you will have to either use .*
/.*?
(with or without dotall modifier), or - if you need the closest match - a tempered greedy token (especially, if some substring must be excluded).
Upvotes: 1