Reputation: 2035
I am building a script to scan through HTML files and replace all 'src' and 'href' attributes under certain conditions. Here is the regex I have right now - (href|src)=["|'](.*?)["|']
.
What I am not sure on is expanding the (.*?)
to say unless it contains mailto:
, https://
or if it does not http://www.google.co.uk
for example.
The basic idea of this script is to replace all assets not covered by SSL and put them under an SSL secured URL.
Does anyone know how this can be achieved?
Many thanks.
Upvotes: 1
Views: 1995
Reputation: 41838
Here is your expression with a number of tweaks for improved syntax:
(?:href|src)=(["'])(?!mailto|https).*?\1
href
or src
into their own capture group, so a non-capturing group will do: (?:
|
from the character class for the opening quotes as it does not mean OR
(["'])
, which enables us to ensure that the closing quote is the same type by using the back-reference \1
. Otherwise your expression would match src="http://google.com'
(double quote and single quote = unbalanced).*?
presumably does not need to be in a capture group\1
refers to capture Group 1, that is to say the content of the first capturing parentheses, i.e., either a single or a double quote, ensuring that we match the same kind of quote at the beginning and at the end.Upvotes: 3
Reputation: 2035
OK after a little bit more research I found the answer to this. My regex has been expanded to the below.
(href|src)=["|']((?!mailto|https).*?)["|']
. Examples below -
src="http://google.co.uk" > match
src='http://google.co.uk' > match
src="/css/test.css" > match
src='/css/test.css' > match
src="css/test.css" > match
src='css/test.css' > match
src="https://google.co.uk" > no match
src='https://google.co.uk' > no match
src="mailto:[email protected]" > no match
src='mailto:[email protected]' > no match
Upvotes: 0