Jackson Henley
Jackson Henley

Reputation: 1531

Why won't my regex lookback work on a URL using Ruby 1.9?

I would like to have this regex:

.match(/wtflungcancer.com\/\S*(?<!js)/i)

NOT match the following string based on the fact that 'js' is present. However, the following matches the entire URL:

"http://www.wtflungcancer.com/wp-content/plugins/contact-form-7/includes/js/jquery.form.min.js?ver=3.32.0-2013.04.03".match(/wtflungcancer.com\/\S*(?<!js)/i)

Upvotes: 1

Views: 79

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can try with this pattern:

wtflungcancer.com\/(?>[^\s.]++|\.++(?!js))*(?!\.)

Explanations:

The goal is to allow all characters that are not a space or a dot followed by js:

(?>                # open an atomic group
    [^\s.]++       # all characters but white characters and .
  |                # OR
    \.++(?!js)     # . not followed by js
)*                 # close the atomic group, repeat zero or more times

To be sure that your pattern check all the url string, i add a lookahead that check if a dot don't follow.

Upvotes: 1

Ju Liu
Ju Liu

Reputation: 3999

This happens because \S* eats all the characters, so the lookbehind is never activated.

Something like this should work:

/wtflungcancer.com(?!\S*\.js)/i

Basically

  • do not let the * consume all characters
  • instead of using a lookbehind, use a lookahead
  • search for strings containing wtflungcancer.com NOT followed by a string containing ".js"

-- EDIT: more explanation added --

What is the difference between

"wtflungcancer.com\S*(?<!\.js)"

and

"wtflungcancer.com(?!\S*\.js)"

They look really similar!

Lookarounds (lookahead and lookbehind) in regular expressions tell the regexp engine when a match is correct or not: they do not consume characters of the string.

Especially lookbehinds tell the regexp engine to look backwards, in your case the lookbehind wasn't anchored on the right side, so the "\S*" just consumed all the non whitespace characters in the string.

For example, this regexp can work for finding url NOT ending with ".js":

wtflungcancer.com\S+(?<!\.js)$

See? The right side of the lookbehind is anchored using the end of string metacharacter.

In our case, though we couldn't hook anything to the right side, so I switched from lookbehind to lookahead

So, the real regular expression just matches "wtflungcancer.com": at that point, the lookahead tells the regexp engine: "In order for this match to be correct, this string must not be followed by a sequence of non-whitespace characters followed by '.js'". This works because lookaheads do not consume actual characters, they just move on character by character to see if the match is good or not.

Upvotes: 2

Related Questions