John
John

Reputation: 3546

C# Regex for URL's

trying to get a regex that will match a url e.g. 'http://www.test.com' and then going to put anchor tags around it - that part is working already with following:

regex = @"(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])"
msg = r.Replace( msg, "<a target=\"_blank\" href=\"$0\">$0</a>" );

but when there are image tags in the input text it incorrectly puts anchor tags inside the image tag's src attribute e.g.

<img src="<a>...</a>" />;

so far I'm trying this to bypass that: (not working)

regex = @"(?!(src=""))(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])"

EDIT:

(example testing input):

<p>
    www.test1.com<br />
    <br />
    http://www.test2.com<br />
    <br />
    https://www.test3.com<br />
    <br />
    &quot;https://www.test4.com<br />
    <br />
    &#39;https://www.test4.com<br />
    <br />
    =&quot;https://www.test4.com</p>
<p>
    &nbsp;</p>
<p>
    <img alt="" src="..." style="width: 500px; height: 375px;" /></p>

(example output):

<p>
    <a target="_blank" href="www.test1.com">www.test1.com</a><br />
    <br />
    <a target="_blank" href="http://www.test2.com">http://www.test2.com</a><br />
    <br />
    <a target="_blank" href="https://www.test3.com">https://www.test3.com</a><br />
    <br />
    &quot;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
    <br />
    &#39;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
    <br />
    =&quot;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a></p>
<p>
    &nbsp;</p>
<p>
    <img alt="" src="<a target="_blank" href="...">...</a>" style="width: 500px; height: 375px;" /></p>

(desired output ):

<p>
    <a target="_blank" href="www.test1.com">www.test1.com</a><br />
    <br />
    <a target="_blank" href="http://www.test2.com">http://www.test2.com</a><br />
    <br />
    <a target="_blank" href="https://www.test3.com">https://www.test3.com</a><br />
    <br />
    &quot;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
    <br />
    &#39;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
    <br />
    =&quot;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a></p>
<p>
    &nbsp;</p>
<p>
    <img alt="" src="..." style="width: 500px; height: 375px;" /></p>

Upvotes: 0

Views: 286

Answers (2)

John
John

Reputation: 3546

Here's the regex that solved the issue for me:

String regex = @"(?<!(""|'))((http|https|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])";

I used a lookback negative assertion to make sure that the url doesn't have an opening quote before it

Upvotes: 0

G.Y
G.Y

Reputation: 6159

Processing HTML using Regex is a wrong aproach in my opnion.

Putting that to aside - just add that rule after your regex match success:

if(regexResult.Count(c => c == '/') > 2) regexResult has more than two '/' charcters it's an invalid result;

You can add this rule to your regex pattern if it solves your problem.

Upvotes: 1

Related Questions