Reputation: 6058
For some reason the following regex is not behaving as I would expect.
I am trying to extract all links from a html creative, although I can't seem to find a way to handle links with spaces properly.
I know that links should be encoded, but there is no way to encode the links if I cannot find them.
I am testing against this html - notice that the only difference is the space in {your reference}.
Find out <a href="http://website.co.uk?element=1&reference={your reference}"><span style="color:#000000;">what something is here</span></a><span style="color:#000000;">.</span><br />
Find out <a href="http://website.co.uk?element=1&reference={yourreference}"><span style="color:#000000;">what something is here</span></a><span style="color:#000000;">.</span><br />
With the following regex I can only get the link without any spaces as is expected:
href="http(s{0,1}):\/\/(\S+)"
Finds:
href="http://website.co.uk?element=1&reference={yourreference}"
However if I change the \S to a . I expect the check to return the link up to the closing ", but it continues almost to the end of the string:
href="http(s{0,1}):\/\/(.+)"
Finds:
href="http://website.co.uk?element=1&reference={your reference}"><span style="color:#000000;">what something is here</span></a><span style="color:#000000;"
href="http://website.co.uk?element=1&reference={yourreference}"><span style="color:#000000;">what something is here</span></a><span style="color:#000000;"
I also have a number of different checks to pick up different links, the final looks like this:
(href="|href=\')%%siteurl%%(\S*)("|\')
|href="www\.(\S+)"
|href="http(s{0,1}):\/\/(\S+)"
|href=\'www\.(\S+)\'
|href=\'http(s{0,1}):\/\/(\S+)\'
I am not looking for help for this set, just the original regex posted and I will adjust the rest accordingly.
Upvotes: 1
Views: 38