Jim Wright
Jim Wright

Reputation: 6058

Finding links with spaces with regex

For some reason the following regex is not behaving as I would expect.

I am trying to extract all links from a html creative, although I can't seem to find a way to handle links with spaces properly.

I know that links should be encoded, but there is no way to encode the links if I cannot find them.

I am testing against this html - notice that the only difference is the space in {your reference}.

Find out <a href="http://website.co.uk?element=1&amp;reference={your reference}"><span style="color:#000000;">what something is here</span></a><span style="color:#000000;">.</span><br />

Find out <a href="http://website.co.uk?element=1&amp;reference={yourreference}"><span style="color:#000000;">what something is here</span></a><span style="color:#000000;">.</span><br />

With the following regex I can only get the link without any spaces as is expected:

href="http(s{0,1}):\/\/(\S+)"

Finds:

href="http://website.co.uk?element=1&amp;reference={yourreference}"

However if I change the \S to a . I expect the check to return the link up to the closing ", but it continues almost to the end of the string:

href="http(s{0,1}):\/\/(.+)"

Finds:

href="http://website.co.uk?element=1&amp;reference={your reference}"><span style="color:#000000;">what something is here</span></a><span style="color:#000000;"

href="http://website.co.uk?element=1&amp;reference={yourreference}"><span style="color:#000000;">what something is here</span></a><span style="color:#000000;"

I also have a number of different checks to pick up different links, the final looks like this:

(href="|href=\')%%siteurl%%(\S*)("|\')
|href="www\.(\S+)"
|href="http(s{0,1}):\/\/(\S+)"
|href=\'www\.(\S+)\'
|href=\'http(s{0,1}):\/\/(\S+)\'

I am not looking for help for this set, just the original regex posted and I will adjust the rest accordingly.

Upvotes: 1

Views: 38

Answers (1)

vks
vks

Reputation: 67978

href="http(s{0,1}):\/\/(.+?)"

                          ^^

Make your quantifier non greedy.

Upvotes: 1

Related Questions