Reputation: 2869
I have text I'm trying to extract from LogicalID
and SupplyChain
from
<LogicalID>SupplyChain</Logical>
At first I used the following regex:
.*([A-Za-z]+)>([A-Za-z]+)<.*
This matched as follows:
["D", "SupplyChain"]
In a fit of desperation, I tried using the asterisk instead of the plus:
.*([A-Za-z]*)>([A-Za-z]+)<.*
This matched perfectly.
The documentation says *
matches zero or more times and +
matches one or more times. Why is *
greedier than +
?
EDIT: It's been pointed out to me that this isn't the case below. The order of operations explains why the first match group is actually null.
Upvotes: 5
Views: 227
Reputation: 6527
Why is * greedier than +?
It doesnot shows greedness.
The first regex .*([A-Za-z]+)>([A-Za-z]+)<.*
can be represented as
Here Group1 should need to present one or more time for a match.
And the Second .*([A-Za-z]*)>([A-Za-z]+)<.*
as
Here Group1 should need to present Zero or more time for a match.
Upvotes: 1
Reputation: 4383
It's not a difference in greediness. In your first regex:
.*([A-Za-z]+)>([A-Za-z]+)<.*
You are asking for any amount of characters (.*
), then at least a letter, then a >
. So the greedy match has to be D, since *
consumes everything before D.
In the second one, instead:
.*([A-Za-z]*)>([A-Za-z]+)<.*
You want any amount of characters, followed by any amount of letters, then the >
. So the first * consumes everything up to the >
, and the first capture group matches an empty string. I don't think that it "matches perfectly" at all.
Upvotes: 5
Reputation: 785406
You should really be using this regex:
<([A-Za-z]+)>([A-Za-z]+)<
OR
<([A-Za-z]*)>([A-Za-z]+)<
Both will match LogicalID
and SupplyChain
respectively.
PS: Your regex: .*([A-Za-z]*)>([A-Za-z]+)<
is matching empty string as first match.
Working Demo: http://ideone.com/VMsb6n
Upvotes: 2