Reputation: 15988
What does the plus symbol in regex mean?
Upvotes: 51
Views: 71223
Reputation: 626747
A lot depends on where +
symbol appears and what the regex flavor is.
In posix-bre and vim (in a non-very magic mode) flavor, +
matches a literal +
char. E.g. sed 's/+//g' file > newfile
removes all +
chars in file
. If you want to use +
as a quantifier here, use \+
(supported in GNU tools), or replace with \{1,\}
or double the quantified pattern and remove the quantifier from the first part and add *
(zero or more occurrences quantifier) after the other (e.g. sed 's/c++*//'
removes c
followed with one or more +
chars).
In posix-ere and other regex flavors, outside a character class ([...]
), +
acts as a quantifier meaning "one or more, but as many as possible, occurrences of the quantified pattern*. E.g. in javascript, s.replace(/\++/g, '-')
will replace a string like ++++
with a single -
. Note that in NFA regex flavors +
has a lazy counterpart, +?
, that matches "one or more, but as few as possible, occurrences of the quantified pattern".
Inside a character class, the +
char is treated as a literal char, in every regex flavor. [+]
always matches a single +
literal char. E.g. in c#, Regex.Replace("1+2=3", @"[+]", "-")
will result in 1-2=3
. Note it is not a good idea to use a single char inside a character class, only use a character class for two or more chars, or for charsets. E.g. [+0-9]
matches a +
or any ASCII digit chars. In php, preg_replace('~[\s+]+~', '-', '1 2+++3')
will result in 1-2-3
since the regex matches one or more (due to last +
that is a quantifier) whitespaces (\s
) or plus chars (+
insdide the character class).
The +
symbol can also be a part of the possessive quantifier in some PCRE-like regex flavors (php, ruby, java, boost, icu, etc (but no in python re
, .net, javascript). E.g. C\+++(?!\d)
in php PCRE would match C
and then one or more +
symbols (\+
- a literal +
and ++
one more occurrences with allowing to backtrack into this quantified pattern) not followed with a digit. If there is a digit after plus chars the whole match fails. Other examples: a?+
(one or zero a
chars), a{1,3}+
(one to three a
chars as many as possible), a{3}+
(=a{3}
, three a
s), a*+
matches zero or more a
chars.
Upvotes: 4
Reputation: 94143
+
can actually have two meanings, depending on context.
Like the other answers mentioned, +
usually is a repetition operator, and causes the preceding token to repeat one or more times. a+
would be expressed as aa*
in formal language theory, and could also be expressed as a{1,}
(match a minimum of 1 times and a maximum of infinite times).
However, +
can also make other quantifiers possessive if it follows a repetition operator (ie ?+
, *+
, ++
or {m,n}+
). A possessive quantifier is an advanced feature of some regex flavours (PCRE, Java and the JGsoft engine) which tells the engine not to backtrack once a match has been made.
To understand how this works, we need to understand two concepts of regex engines: greediness and backtracking. Greediness means that in general regexes will try to consume as many characters as they can. Let's say our pattern is .*
(the dot is a special construct in regexes which means any character1; the star means match zero or more times), and your target is aaaaaaaab
. The entire string will be consumed, because the entire string is the longest match that satisfies the pattern.
However, let's say we change the pattern to .*b
. Now, when the regex engine tries to match against aaaaaaaab
, the .*
will again consume the entire string. However, since the engine will have reached the end of the string and the pattern is not yet satisfied (the .*
consumed everything but the pattern still has to match b
afterwards), it will backtrack, one character at a time, and try to match b
. The first backtrack will make the .*
consume aaaaaaaa
, and then b
can consume b
, and the pattern succeeds.
Possessive quantifiers are also greedy, but as mentioned, once they return a match, the engine can no longer backtrack past that point. So if we change our pattern to .*+b
(match any character zero or more times, possessively, followed by a b
), and try to match aaaaaaaab
, again the .*
will consume the whole string, but then since it is possessive, backtracking information is discarded, and the b cannot be matched so the pattern fails.
1 In most engines, the dot will not match a newline character, unless the /s
("singleline" or "dotall") modifier is specified.
Upvotes: 81
Reputation: 435
1 or more of previous expression.
[0-9]+
Would match:
1234567890
In:
I have 1234567890 dollars
Upvotes: 11
Reputation: 370112
In most implementations +
means "one or more".
In some theoretical writings +
is used to mean "or" (most implementations use the |
symbol for that).
Upvotes: 21
Reputation: 35983
One or more occurences of the preceding symbols.
E.g. a+
means the letter a
one or more times. Thus, a
matches a
, aa
, aaaaaa
but not an empty string.
If you know what the asterisk (*
) means, then you can express (exp)+
as (exp)(exp)*
, where (exp)
is any regular expression.
Upvotes: 6