Reputation: 3546
trying to get a regex that will match a url e.g. 'http://www.test.com' and then going to put anchor tags around it - that part is working already with following:
regex = @"(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])"
msg = r.Replace( msg, "<a target=\"_blank\" href=\"$0\">$0</a>" );
but when there are image tags in the input text it incorrectly puts anchor tags inside the image tag's src attribute e.g.
<img src="<a>...</a>" />;
so far I'm trying this to bypass that: (not working)
regex = @"(?!(src=""))(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])"
EDIT:
(example testing input):
<p>
www.test1.com<br />
<br />
http://www.test2.com<br />
<br />
https://www.test3.com<br />
<br />
"https://www.test4.com<br />
<br />
'https://www.test4.com<br />
<br />
="https://www.test4.com</p>
<p>
</p>
<p>
<img alt="" src="..." style="width: 500px; height: 375px;" /></p>
(example output):
<p>
<a target="_blank" href="www.test1.com">www.test1.com</a><br />
<br />
<a target="_blank" href="http://www.test2.com">http://www.test2.com</a><br />
<br />
<a target="_blank" href="https://www.test3.com">https://www.test3.com</a><br />
<br />
"<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
<br />
'<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
<br />
="<a target="_blank" href="https://www.test4.com">https://www.test4.com</a></p>
<p>
</p>
<p>
<img alt="" src="<a target="_blank" href="...">...</a>" style="width: 500px; height: 375px;" /></p>
(desired output ):
<p>
<a target="_blank" href="www.test1.com">www.test1.com</a><br />
<br />
<a target="_blank" href="http://www.test2.com">http://www.test2.com</a><br />
<br />
<a target="_blank" href="https://www.test3.com">https://www.test3.com</a><br />
<br />
"<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
<br />
'<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
<br />
="<a target="_blank" href="https://www.test4.com">https://www.test4.com</a></p>
<p>
</p>
<p>
<img alt="" src="..." style="width: 500px; height: 375px;" /></p>
Upvotes: 0
Views: 286
Reputation: 3546
Here's the regex that solved the issue for me:
String regex = @"(?<!(""|'))((http|https|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])";
I used a lookback negative assertion to make sure that the url doesn't have an opening quote before it
Upvotes: 0
Reputation: 6159
Processing HTML using Regex is a wrong aproach in my opnion.
Putting that to aside - just add that rule after your regex match success:
if(regexResult.Count(c => c == '/') > 2) regexResult has more than two '/' charcters it's an invalid result;
You can add this rule to your regex pattern if it solves your problem.
Upvotes: 1