Reputation:
So I understand that [^A-Za-z]
would match any character that's not a letter.
Is there any way to do this with a group? For example: (?^:&)
- would match any sequence of characters that is not the sequence &
NOTE: as Mark Reed pointed out, it would be pointless to match an empty string, as an empty string is a sequence of characters that is not the sequence, so I would like the regex to match as many characters as possible
FOR EXAMPLE:
in Ben & Jerry's
the matches would be Ben
and Jerry's
(note that the whitespaces after Ben
and before Jerry's
are captured too.
NOTE: if possible, please do not use look behinds, because I will be using the regex in a JS script, and Javascript does not support look behinds.
Upvotes: 5
Views: 8052
Reputation: 139471
See Randal’s Rule.
Randal's Rule
Randal Schwartz (author of Learning Perl) says:
Use capturing when you know what you want to keep.
Use
split
when you know what you want to throw away.
var s = "Ben & Jerry's";
var a = s.split(/&/);
document.body.innerHTML = "<pre>[" + a.join("][") + "]</pre>";
To show how much work (?!...)
for negative look-ahead saves us, the equivalent regex to match a string that does not contain the sequence &
is
^([^&]|&+[^&a]|(&+a)+([^&m]|&+[^&a])|(&+a)+m((&+a)+m)*([^&p]|&+[^&a]|(&+a)+([^&m]|&+[^&a]))|(&+a)+m((&+a)+m)*p((&+a)+m((&+a)+m)*p)*([^&;]|&+[^&a]|(&+a)+([^&m]|&+[^&a])|(&+a)+m((&+a)+m)*([^&p]|&+[^&a]|(&+a)+([^&m]|&+[^&a]))))*(&+|(&+a)+(&+)?|(&+a)+m((&+a)+m)*(&+|(&+a)+(&+)?)?|(&+a)+m((&+a)+m)*p((&+a)+m((&+a)+m)*p)*(&+|(&+a)+(&+)?|(&+a)+m((&+a)+m)*(&+|(&+a)+(&+)?)?)?)?$
Upvotes: 0
Reputation: 626826
What you need is a regex that will match alternatives, and will only capture into Group 1 the last alternative that will present a tempered greedy token (or an unrolled version for better performance - if you only have 2 or 3):
&|((?:(?!&)[\s\S])+)
See the regex demo (an unrolled version - &|([^&]*(?:&(?!amp;)[^&]*)*)
The pattern:
&
- matches &
entity|
- or((?:(?!&)[\s\S])+)
- matches and captures into group 1 any chunk of text (1+ characters) that is not a starting point for a &
sequence. Since it is for JS, you need a [\s\S]
(or [^]
) to match any character including a newline. Otherwise, use .
instead (if you only intend to match lines).var re = /&|((?:(?!&)[\s\S])+)/g;
var str = 'abc Ben & Jerry\'s foobar ssss sss sss &\n\n\nsssss&sssss &\n\nsssss&sssss &sssss\n&sssss&\n&&';
var res = [];
while ((m = re.exec(str)) !== null) {
if (m.index === re.lastIndex) {// A part of code only necessary for the
re.lastIndex++; // unrolled pattern (as it can match empty string)
}
res.push(m[1]); // Only collect the captured texts
}
document.body.innerHTML = "<pre>BEFORE:<br/>" + str.replace(/&/g, '&') + "</pre>";
document.body.innerHTML += "<pre>AFTER:<br/>" + res.join("") + "</pre>";
Upvotes: 4
Reputation: 19156
Easy:
(.*?)(?:&)|((?!&).*)$
(.*?)
: Take everything but non greedy.(?:&)
: ?:
is non-capturing group. A group that you don't want to get the value.((?!&).*)$
: get the rest of the string which is not &
Upvotes: 3