Reputation: 4203
I'm building code matching and replacing several types of patterns (bbCode). One of the matches I'm trying to make, is [url=http:example.com] replacing all with anchor links. I'm also trying to match and replace plain textual urls with anchor links. And the combination of these two is where I'm running in to some trouble.
Since my routine is recursive, matching and replacing the entire text each run, I'm having trouble NOT replacing urls already contained in anchors.
This is the recursive routine I'm running:
if(text.search(p.pattern) !== -1) {
text = text.replace(p.pattern, p.replace);
}
This is my regexp for plain urls so far:
/(?!href="|>)(ht|f)tps?:\/\/.*?(?=\s|$)/ig
And URLs can start with http or https or ftp or ftps, and contain whatever text afterwards, ending with whitespace or a punctuation mark (. / ! / ? / ,)
Just to be absolutely clear, I'm using this as a test for matches:
Should match:
Should not match
I would really appretiate any help I can get here.
EDIT The first accepted solution by jkshah below does have some flaws. For instance, it will match
<img src="http://www.example.com/test.jpg">
The comments in Jerry's solution however did make me want to try it again, and that solution solved this issue as well. I therefore accepted that solution instead. Thank you all for your kind help on this. :)
Upvotes: 4
Views: 2448
Reputation: 71538
Maybe something like this?
/(?:(?:ht|f)tps?:\/\/|www)[^<>\]]+?(?![^<>\]]*([>]|<\/))(?=[\s!,?\]]|$)/gm
And then trim the dots at the end if any.
Though if the link contains more punctuations, it might cause some issues... I would then suggest capturing the link first, then remove the trailing punctuations with a second replace.
[^<>\]]+
will match every character except <
, >
and ]
(?![^<>\]]*([>]|<\/))
prevents the matching of a link between html tags.
(?=[\s!,?\]]|$)
is for the punctuations and whitespace.
Upvotes: 3
Reputation: 11703
Following regex should work. It's giving desired result on your sample inputs.
/((?:(?:ht|f)tps?:\/\/|www)[^\s,?!]+(?!.*<\/a>))/gm
See it in action here
(?!.*<\/a>)
- Negative lookahead for anchor
Matching content will be stored in $1
and can be used in replace string.
EDIT
To not match content with <img src ..
following can be used
(^(?!.*<img\s+src)(?:(?:ht|f)tps?:\/\/|www)[^\s,?!]+(?!.*<\/a>))
Upvotes: 1
Reputation: 732
can p.replace
be a function? if so:
var text = 'http://www.example.com \n' +
'http://www.example.com/test \n' +
'http://example.com/test \n' +
'www.example.com/test \n' +
'<a href="http://www.example.com">http://www.example.com </a>\n' +
'<a href="http://www.example.com/test">http://www.example.com/test </a>\n' +
'<a href="http://example.com/test">http://example.com/test </a>\n' +
'<a href="www.example.com/test">www.example.com/test </a>';
var p = {
flag: true,
pattern: /(<a[^<]*<\/a>)|((ht|f)tps?:\/\/|www\.).*?(?=\s|$)/ig,
replace: function ($0, $1) {
if ($1) {
return $0;
} else {
p.flag = true;
return "construct replacement string here";
}
}
};
while(p.flag){
p.flag = false;
text = text.replace(p.pattern, p.replace);
}
The part of the regex I added is (<a[^<]*<\/a>)|
to check if the url is anywhere inside an anchor, if so then the replacement function ignores it.
If you want to avoid the url inside <a href="...">
but other urls inside the anchor are to be replaced, then change (<a[^<]*<\/a>)|
to (<a[^>]*>)|
Upvotes: 0