Zim-Zam O'Pootertoot
Zim-Zam O'Pootertoot

Reputation: 18148

Negative lookbehind in a regex with an optional prefix

We are using the following regex to recognize urls (derived from this gist by Jim Gruber). This is being executed in Scala using scala.util.matching which in turn uses java.util.regex:

(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b/?(?!@)))

This version has escaped forward slashes, for Rubular:

(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))

Previously the front-end was only sending plaintext to the back end, however now they're allowing users to create anchor tags for urls. Therefore the back end now needs to recognize urls except for those that are already in anchor tags. I initially tried to accomplish this with a negative loohbehind, ignoring urls with a href=" prefix

(?i)\b((?<!href=")((?:https?: ... etc

The problem is that our url regex is very liberal, recognizing http://www.google.com, www.google.com, and google.com - given

 <a href="http://www.google.com">Google</a>

the negative lookbehind will ignore http://www.google.com, but then the regex will still recognize www.google.com. I'm wondering if there's a succinct way to tell the regex "ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com"

At present I'm using a filter on the url regex matches (code is in Scala) - this also ignores urls in link text (<a href="http://www.google.com">www.google.com</a>) by ignoring urls with a > prefix and </a> suffix. I'd rather stick with the filter if doing this in a regex would make an already complicated regex even more unreadable.

urlPattern.findAllMatchIn(text).toList.filter(m => {
  val start: Int = m.start(1)
  val end: Int = m.end(1)
  val isHref: Boolean = (start - 6 > 0) && 
    text.substring(start - 6, start) == """href=""""
  val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length && 
    text.substring(start - 1, start) == ">" && 
    text.substring(end, end + 3) == "</a>")
  !(isHref || isAnchor) && Option(m.group(1)).isDefined
})

Upvotes: 3

Views: 578

Answers (3)

vks
vks

Reputation: 67988

<a href=\S+|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))

or

<a href=(?:(?!<\/a>).)*<\/a>|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))

Try this. What it essentially does is:

  1. Consumes all href links so that it cannot be matched later

  2. Does not capture it so it will not appear in groups anyways.

  3. Process the rest as before.

See demo.

http://regex101.com/r/vR4fY4/17

Upvotes: 1

Unihedron
Unihedron

Reputation: 11051

It seems that you're not only wanting to ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com", but instead any substring fragments from a previously ignored section... In which case, you can use a bit of code to work around this! Please see the regex:

(a href=")?(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))
^^^^^^^^^^^

I'm not good at scala but you can probably do this:

val links = new Regex("""(a href=")?(?i)\b(((?:https?:... """.r, "unwanted")
val unwanted = for (o <- links findAllMatchIn text) yield o group "unwanted"

If unwanted is scala.Null, then the match is useful.

You can workaround for a need of replacement by replacing an alternative:

a href="(?i)\b(?:(?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))|((?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))))

The second part of the regex behind the pipe | is grouped as a capturing group. You can replace by this regex with the first group: \1

Similar question:

Upvotes: 1

PeterK
PeterK

Reputation: 1723

How about just adding the <a href= part as an optional group, then when checking your matching, you only return those matches in which that group is empty?

Upvotes: 0

Related Questions