Reputation: 990
I want to find a erroneous NCR without &# and remedy it, the unicode is 4 or 5 decimal digit, I write this PHP statement:
function repl0($m) {
return '&#'.$m[0];
}
$s = "This is a good 23200; sample ship";
echo "input1= ".htmlentities($s)."<br>";
$out1=preg_replace_callback('/(?<!#)(\d{4,5};)/','repl0',$s);
echo 'output1 = '.htmlentities($out1).'<br>';
The output is:
input1= This is a good 23200; sample ship
output1 = This is a good 2ಀ sample ship
The match only happens once according to the output message. What I want is to match '23200;' instead of '3200;'. Default should be greedy mode and I thought it will capture 5-digit number instead 4-digit's Do I misunderstand 'greedy' here? How can I get what I want?
Upvotes: -1
Views: 32
Reputation: 627400
The (?<!#)(\d{4,5};)
pattern matches like this:
(?<!#)
- matches a location that is not immediately preceded with #
(\d{4,5};)
- then tries to match and consume four or five digits and a ;
char immediately after these digits.So, if you have #32000;
string input, 3
cannot be a starting character of a match, as it is preceded with #
, but 2
can since it is not preceded by a #
and there are five digits with a ;
for the pattern to match.
What you need here is to curb the match on the left by adding a digit to the lookbehind,
(?<![#\d])(\d{4,5};)
With this trick, you ensure that the match cannot be immediately preceded with either #
or a digit.
You say you finally used (?<!#)(?<!\d)\d{4,5};
, and this pattern is functionally equivalent to the pattern above since the lookbehinds, as all lookarounds, "stand their ground", i.e. the regex index does not move when the lookaround patterns are matched. So, the check for a digit or a #
char occurs at the same location in the string.
Upvotes: 0