Reputation: 103
In InDesign, I’m using the GREP expression (?<=.)/(?=.)
to locate all occurrences of the slash character /
throughout a document.
For example, I want to find the character /
in Color/Colour
or American English/British English
in order to apply a certain styling to the slash.
However, I want to limit this to all words/strings that do not begin with either http
or www
, so the slashes in https://usa.gov/about
or www.gov.uk/about
should not be included in the results. Lone slashes should/can be ignored.
I have managed to find all words/strings that begin with either http
or www
with \<www|\<http
, however, I’m not able to combine the two.
I’ve tried the following but with no success:
(?<=.)(?<!\<www|\<http)/(?=.)
From what I can see, InDesign uses the Perl Regular Expression Syntax boost libraries.
Upvotes: 2
Views: 75
Reputation: 163577
If the regex engine is boost as you state in your comment, you could make use of SKIP FAIL backtracking control verbs to first match what you don't want and then skip the match:
(?<!\S)(?:(?:https?|www)\S+|/+(?!\S))(*SKIP)(*F)|/
The pattern matches:
(?<!\S)
Assert a whitespace boundary to the left(?:
Non capture group for the alternatives
(?:
Non capture group
https?
Match http or https|
Orwww
match literally)
Close the non capture group (You might append \b
here for a word boundary)\S+
Match 1+ non whitespace characters|
Or/+
Match 1 or more times /
(?!\S)
Assert a whitespace boundary to the right)
Close the non capture group(*SKIP)(*F)
Skip the match|
Or/
Match /
See a regex demo.
Upvotes: 2
Reputation: 189789
As formulated, your attempt will ensure that the text immediately before /
is not www
or http
. We can't see your test data, so what exactly you need isn't entirely clear; but probably something like
\b(?!(?:http|www))\w+/(?=\w+)
The word boundary \b
anchors the expression to the beginning of a "word" (what exactly this means depends on your regex engine and perhaps your locale; typically something like alphabetics, numbers, and perhaps @ and underscore) and that's where we anchor the negative lookahead. We require this to be followed by an arbitrary number of "word" characters, a slash, and more "word" characters.
For example, in a URL, this would match components in the URL path (like com
and more
and stuff
in http://example.com/more/stuff
); if this is not what you actually want, perhaps edit your question to clarify in more detail what exactly you need.
Demo: https://regex101.com/r/7Y9TYY/1
If you want to extract just the slash (though that would be slightly weird, I think?) you can add capturing parentheses around it.
If your regex engine permits it, you can put everything from \b
to just before the slash in a lookbehind; however, many engines do not permit variable-width negative lookarounds.
Upvotes: 1