Jens Draser-Schieb
Jens Draser-Schieb

Reputation: 103

How to limit lookbehind to strings which do not start with certain characters?

In InDesign, I’m using the GREP expression (?<=.)/(?=.) to locate all occurrences of the slash character / throughout a document.

For example, I want to find the character / in Color/Colour or American English/British English in order to apply a certain styling to the slash.

However, I want to limit this to all words/strings that do not begin with either http or www, so the slashes in https://usa.gov/about or www.gov.uk/about should not be included in the results. Lone slashes should/can be ignored.

I have managed to find all words/strings that begin with either http or www with \<www|\<http, however, I’m not able to combine the two.

I’ve tried the following but with no success:

(?<=.)(?<!\<www|\<http)/(?=.)

From what I can see, InDesign uses the Perl Regular Expression Syntax boost libraries.

Upvotes: 2

Views: 75

Answers (2)

The fourth bird
The fourth bird

Reputation: 163577

If the regex engine is boost as you state in your comment, you could make use of SKIP FAIL backtracking control verbs to first match what you don't want and then skip the match:

(?<!\S)(?:(?:https?|www)\S+|/+(?!\S))(*SKIP)(*F)|/

The pattern matches:

  • (?<!\S) Assert a whitespace boundary to the left
  • (?: Non capture group for the alternatives
    • (?: Non capture group
      • https? Match http or https
      • | Or
      • www match literally
    • ) Close the non capture group (You might append \b here for a word boundary)
    • \S+ Match 1+ non whitespace characters
    • | Or
    • /+ Match 1 or more times /
    • (?!\S) Assert a whitespace boundary to the right
  • ) Close the non capture group
  • (*SKIP)(*F) Skip the match
  • | Or
  • / Match /

See a regex demo.

Upvotes: 2

tripleee
tripleee

Reputation: 189789

As formulated, your attempt will ensure that the text immediately before / is not www or http. We can't see your test data, so what exactly you need isn't entirely clear; but probably something like

\b(?!(?:http|www))\w+/(?=\w+)

The word boundary \b anchors the expression to the beginning of a "word" (what exactly this means depends on your regex engine and perhaps your locale; typically something like alphabetics, numbers, and perhaps @ and underscore) and that's where we anchor the negative lookahead. We require this to be followed by an arbitrary number of "word" characters, a slash, and more "word" characters.

For example, in a URL, this would match components in the URL path (like com and more and stuff in http://example.com/more/stuff); if this is not what you actually want, perhaps edit your question to clarify in more detail what exactly you need.

Demo: https://regex101.com/r/7Y9TYY/1

If you want to extract just the slash (though that would be slightly weird, I think?) you can add capturing parentheses around it.

If your regex engine permits it, you can put everything from \b to just before the slash in a lookbehind; however, many engines do not permit variable-width negative lookarounds.

Upvotes: 1

Related Questions