Reputation: 19562
I have a rather basic question about regexes.
I use the expression .*
without thinking about it match expecting to match e.g. up to the end of the line. This works.
But for some reason I started thinking about this expression. Checking Wikipedia (my emphasis)
. Matches any single character
* Matches the **preceding** element zero or more times
So now according to this definition, why doesn't .*
try to match the first character in the string 0 or more times but instead tries to apply the match to each character in the string?
I mean if I have abc
it should try to match a,aa,aaa etc
right?
But it does not:
~
$ perl -e '
> my $var="abcdefg";
> $var =~ /(.*)/;
> print "$1\n";'
abcdefg
Upvotes: 4
Views: 529
Reputation: 21
The . regular expression doesn't have a memory. Once it matched the "a" in "abc" it forgets about it when trying to match the "b".
Upvotes: 2
Reputation: 272487
This:
.{2,4}
is really shorthand for this:
(..)|(...)|(....)
In the same way, this:
.*
is really shorthand for this:
()|(.)|(..)|(...)| // etc.
Upvotes: 3
Reputation: 6221
Confusion starts with the word "element" in Matches the **preceding** element zero or more times
. Term "preceding element" here refers to "preceding pattern" rather than to "preceding capture" (or "preceding match").
Upvotes: 2
Reputation: 8420
The character dot .
matches any element
Now the character *
matches the preceding element (which is any element in our case) 0 or more times.
By:
the preceding element zero or more times
element means .
and not the preceding character match. It has nothing to do with previous matches. It only repeats the dot 0 or more times.
It's like writing .?.?
and infinite number of time.
Upvotes: 1
Reputation: 8815
The .
means any single character, as per the paste from Wikipedia. That doesn't mean the first character only, but really, as it says there, any character - that is, any type of character (as opposed to, say, numbers, or white space-type character). So you are saying, match 0 or more occurrences of any type of character at all
, which of course matches your whole line.
Upvotes: 1
Reputation: 2241
*
applies to the preceding element of the regular expression zero or more times - notice the page you link refers to a "pattern element". Therefore when attempting a match at the start of the string, it matches any single character; then it matches any single character, etc.
Similarly if you say (A|B)*
, it doesn't pick one of A
or B
then match it repeatedly; it picks one of A
or B
then "starts over".
Upvotes: 1