Reputation: 17049
I need to add "php" to all urls in href="xxx"
, that don't end with "php".
I use negetive lookahead (?!php)
:
find = r'href="(.+?)(?!php)"'
replace = r'href="\1.php"'
re.sub(find, replace, 'href="url"')
re.sub(find, replace, 'href="url.php"')
both add extension:
href="url.php"
href="url.php.php"
Why negative lookahead doesn't work?
Upvotes: 2
Views: 195
Reputation: 500167
The following does work:
In [49]: re.sub(r'href="([^"]*?)([.]php)?"', r'href="\1.php"', 'href="url.php"')
Out[49]: 'href="url.php"'
In [50]: re.sub(r'href="([^"]*?)([.]php)?"', r'href="\1.php"', 'href="url"')
Out[50]: 'href="url.php"'
The reason your original regex (.+?)(?!php)
doesn't quite work is that it matches url.php
as follows:
(.+?)
matches url.php
;In other words, .+?
consumes the entire filename including the extension, making the lookahead a no-op.
Upvotes: 4
Reputation: 155
Negative lookahead means the regexp tries to match next pattern, but does not consume the pattern. Your pattern "(.+?)(?!php)"
matches 1 or more number of any characters until it meets "
, then tries to match next pattern, which is php
. This lookahead will always fail, because the next character is "
, and since this is a NEGATIVE lookahead, the whole pattern will succeed.
What you need is the negative lookbehind, ((?<!PATTERN)
) which will try to match the pattern AFTER the character is consumed. When it meets "
, lookbehind pattern tries to match last 3 characters against pattern php
.
In short, please try again with below pattern
find = 'href="(.+?)(?<!php)"'
Upvotes: 1