Reputation: 512
I've read many Questions on StackOverflow, including this one, this one, and even read Rexegg's Best Trick, which is also in a question here. I found this one, which works on entire lines, but not "everything up to the bad word". None of these have helped me, so here I go:
In Javascript, I have a long regex pattern. I'm trying to match a sequence in similar sentence structures, like follows:
1 UniquePrefixA [some-token] and [some-token] want to take [some-token] to see some monkeys.
2 UniqueC [some-token] wants to take [some-token] to the store. UniqueB, [some-token] is in the pattern once more.
3 UniquePrefixA [some-token] is using [some-token] to [some-token].
Notice that each pattern starts with a unique prefix. Encountering that prefix signals the start of a pattern. If I encounter that pattern again during capture, I should not capture a second occurance, and STOP THERE. I'll have captured everything up to that prefix.
If I don't encounter the prefix later in the pattern, I need to continue matching that pattern.
I'm also using capture groups (not repeating, since Capture Groups only return the last matched of that group). The capture group contents need to be returned, so I'm using match, non-greedy.
Here's my pattern and a working example
/(?:UniquePrefixA|UniqueB|UniqueC)\s*(\[some-token\])(?:and|\s)*(\[some-token\])?(\s|[^\[\]])*(\[some-token\])? --->(\s|[^\[\]])*<--- (\[some-token\])?(\s|[^\[\]])*/i
It's basically 2 repeating patterns in a specific order:
(\s|[^\[\]])* // Basicaly .*, but excluding brackets
(\[some-token\]) // A token [some-token]
How I can prevent the match from continuing past a black list of words?
I want this to happen where I drew three arrows, for context. The equivalent of Any character, but not the contents of this list: (UniquePrefixA|UniqueB|UniqueC) (as seen in capture group 1).
It's possible I need a better understanding of negative lookahead, or if it can work with a group of things. Most importantly, I'm looking to know if a negative look-ahead approach can support a list of options Or is there a better way altogether? If the answer is "you can't do that," that's cool too.
Upvotes: 0
Views: 493
Reputation: 19335
to ensure a pattern not occurs in a repeating character sequence such as (\s|[^\[\]])*
, note that \s
is included in [^\[\]]
so may be just [^\[\]]*
, is to prepend a negative lookahead (which is a zero lentgh match assertion like ^
) at the left and inside the repeating pattern so that it is checked for every character :
((?!UniquePrefixA)(\s|[^\[\]]))*
Upvotes: 0
Reputation: 31011
I think, an easier to maintain solution is to divide your task into 2 parts:
Find each chunk of text starting from any of your unique prefixes, up to the next or to the end of string.
Process each such chunk, looking for your some tokens and maybe also the content between them.
The regex performing the first task should include 3 parts:
(?:UniquePrefixA|UniqueB|UniqueC)
- A non-capturing group looking
for any unique prefix.((?:.|\n)+?)
- A capturing group - the fragment to catch for further
processing (see the note below).(?=UniquePrefixA|UniqueB|UniqueC|$)
- A positive lookahead, looking
for either any unique prefix or the end of the string (a stop criterion
you are looking for).To sum up, the whole regex looks like below:
/(?:UniquePrefixA|UniqueB|UniqueC)((?:.|\n)+?)(?=UniquePrefixA|UniqueB|UniqueC|$)/gi
Note: Unfortunately, JavaScript flavour of regex does not implement
single-line (-s) option. So, instead of just .
in the capturing group
above, you must use (?:.|\n)
, meaning:
\n
(.
),\n
.Both these variants are "enveloped" into a non-capturing group,
to put limits of variants (both sides of |
), because the repetition
marker (+?
) pertains to both variants.
Note ?
after +
, meaning the reluctant version.
So this part of regex (the capturing group) will match any sequence of chars
including \n
, ending before the next uniqie prefix (if any),
just as you expect.
The second task is to apply another regex to the captured chunk (group 1),
looking for [some-token]
s and possibly the content between them.
You didn't specify what you want exactly do with each chunk,
so I'm not sure what this second regex shoud include.
Maybe it will be enough just to match [some-token]
?
Upvotes: 1