Jim
Jim

Reputation: 19562

Confused on a basic operation of regular expressions

I have a rather basic question about regexes.
I use the expression .* without thinking about it match expecting to match e.g. up to the end of the line. This works.
But for some reason I started thinking about this expression. Checking Wikipedia (my emphasis)

.  Matches any single character  
*  Matches the **preceding** element zero or more times  

So now according to this definition, why doesn't .* try to match the first character in the string 0 or more times but instead tries to apply the match to each character in the string?
I mean if I have abc it should try to match a,aa,aaa etc right?
But it does not:

 ~
$ perl -e '  
> my $var="abcdefg";  
> $var =~ /(.*)/;   
> print "$1\n";'   
abcdefg   

Upvotes: 4

Views: 529

Answers (6)

jorgeluis123
jorgeluis123

Reputation: 21

The . regular expression doesn't have a memory. Once it matched the "a" in "abc" it forgets about it when trying to match the "b".

Upvotes: 2

Oliver Charlesworth
Oliver Charlesworth

Reputation: 272487

This:

.{2,4}

is really shorthand for this:

(..)|(...)|(....)

In the same way, this:

.*

is really shorthand for this:

()|(.)|(..)|(...)| // etc.

Upvotes: 3

Kuba Wyrostek
Kuba Wyrostek

Reputation: 6221

Confusion starts with the word "element" in Matches the **preceding** element zero or more times. Term "preceding element" here refers to "preceding pattern" rather than to "preceding capture" (or "preceding match").

Upvotes: 2

Hugo Dozois
Hugo Dozois

Reputation: 8420

The character dot . matches any element

Now the character * matches the preceding element (which is any element in our case) 0 or more times.

By:

the preceding element zero or more times

element means . and not the preceding character match. It has nothing to do with previous matches. It only repeats the dot 0 or more times.

It's like writing .?.? and infinite number of time.

Upvotes: 1

Martin Dinov
Martin Dinov

Reputation: 8815

The . means any single character, as per the paste from Wikipedia. That doesn't mean the first character only, but really, as it says there, any character - that is, any type of character (as opposed to, say, numbers, or white space-type character). So you are saying, match 0 or more occurrences of any type of character at all, which of course matches your whole line.

Upvotes: 1

Nicholas W
Nicholas W

Reputation: 2241

* applies to the preceding element of the regular expression zero or more times - notice the page you link refers to a "pattern element". Therefore when attempting a match at the start of the string, it matches any single character; then it matches any single character, etc.

Similarly if you say (A|B)*, it doesn't pick one of A or B then match it repeatedly; it picks one of A or B then "starts over".

Upvotes: 1

Related Questions