Reputation: 18148
We are using the following regex to recognize urls (derived from this gist by Jim Gruber). This is being executed in Scala using scala.util.matching
which in turn uses java.util.regex
:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b/?(?!@)))
This version has escaped forward slashes, for Rubular:
(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))
Previously the front-end was only sending plaintext to the back end, however now they're allowing users to create anchor tags for urls. Therefore the back end now needs to recognize urls except for those that are already in anchor tags. I initially tried to accomplish this with a negative loohbehind, ignoring urls with a href="
prefix
(?i)\b((?<!href=")((?:https?: ... etc
The problem is that our url regex is very liberal, recognizing http://www.google.com
, www.google.com
, and google.com
- given
<a href="http://www.google.com">Google</a>
the negative lookbehind will ignore http://www.google.com
, but then the regex will still recognize www.google.com
. I'm wondering if there's a succinct way to tell the regex "ignore www.google.com
and google.com
if they are substrings of an ignored http(s)://www.google.com
"
At present I'm using a filter on the url regex matches (code is in Scala) - this also ignores urls in link text (<a href="http://www.google.com">www.google.com</a>
) by ignoring urls with a >
prefix and </a>
suffix. I'd rather stick with the filter if doing this in a regex would make an already complicated regex even more unreadable.
urlPattern.findAllMatchIn(text).toList.filter(m => {
val start: Int = m.start(1)
val end: Int = m.end(1)
val isHref: Boolean = (start - 6 > 0) &&
text.substring(start - 6, start) == """href=""""
val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length &&
text.substring(start - 1, start) == ">" &&
text.substring(end, end + 3) == "</a>")
!(isHref || isAnchor) && Option(m.group(1)).isDefined
})
Upvotes: 3
Views: 578
Reputation: 67988
<a href=\S+|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))
or
<a href=(?:(?!<\/a>).)*<\/a>|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))
Try this. What it essentially does is:
Consumes all href
links so that it cannot be matched later
Does not capture it so it will not appear in groups
anyways.
Process the rest as before.
See demo.
http://regex101.com/r/vR4fY4/17
Upvotes: 1
Reputation: 11051
It seems that you're not only wanting to ignore www.google.com
and google.com
if they are substrings of an ignored http(s)://www.google.com"
, but instead any substring fragments from a previously ignored section... In which case, you can use a bit of code to work around this! Please see the regex:
(a href=")?(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))
^^^^^^^^^^^
I'm not good at scala but you can probably do this:
val links = new Regex("""(a href=")?(?i)\b(((?:https?:... """.r, "unwanted")
val unwanted = for (o <- links findAllMatchIn text) yield o group "unwanted"
If unwanted
is scala.Null
, then the match is useful.
You can workaround for a need of replacement by replacing an alternative:
a href="(?i)\b(?:(?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))|((?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))))
The second part of the regex behind the pipe |
is grouped as a capturing group. You can replace by this regex with the first group: \1
Similar question:
Upvotes: 1
Reputation: 1723
How about just adding the <a href=
part as an optional group, then when checking your matching, you only return those matches in which that group is empty?
Upvotes: 0