user1413824
user1413824

Reputation: 669

regex negative look-ahead for exactly 3 capital letters arround a char

im trying to write a regex finds all the characters that have exactly 3 capital letters on both their sides

The following regex finds all the characters that have exactly 3 capital letters on the left side of the char, and 3 (or more) on the right:

'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3})'

When trying to limit the right side to no more then 3 capitals using the regex:

'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3})(?![A-Z])'

i get no results, there seems to be a fail when adding the (?![A-Z]) to the first regex.

can someone explain me the problem and suggest a way to solve it?

Thanks.

Upvotes: 0

Views: 2135

Answers (4)

Alan Moore
Alan Moore

Reputation: 75232

You need to put the negative lookahead inside the positive one:

(?<![A-Z])[A-Z]{3}.(?=[A-Z]{3}(?![A-Z]))

You can do that with the lookbehind, too:

(?<=(?<![A-Z])[A-Z]{3}).(?=[A-Z]{3}(?![A-Z]))

It doesn't violate the "fixed-length lookbehind" rule because lookarounds themselves don't consume any characters.


EDIT (about fixed-length lookbehind): Of all the flavors that support lookbehind, Python is the most inflexible. In most flavors (e.g. Perl, PHP, Ruby 1.9+) you could use:

(?<=^[A-Z]{3}|[^A-Z][A-Z]{3}).

...to match a character preceded by exactly three uppercase ASCII letters. The first alternative - ^[A-Z]{3} - starts looking three positions back, while the second - [^A-Z][A-Z]{3} - goes back exactly four positions. In Java, you can reduce that to:

(?<=(^|[^A-Z])[A-Z]{3}).

...because it does a little extra work at compile time to figure out that the maximum lookbehind length will be four positions. And in .NET and JGSoft, anything goes; if it's legal anywhere, it's legal in a lookbehind.

But in Python, a lookbehind subexpression has to match a single, fixed number of characters. If you've butted your head against that limitation a few times, you might not expect something like this to work:

(?<=(?<![A-Z])[A-Z]{3}).

At least I didn't. It's even more concise than the Java version; how can it work in Python? But it does work, in Python and in every other flavor that supports lookbehind.

And no, there are no similar restrictions on lookaheads, in any flavor.

Upvotes: 1

Ja͢ck
Ja͢ck

Reputation: 173562

Since the look ahead pattern is the same as the look behind pattern, you could also use the continue anchor \G:

/(?:[A-Z]{3}|\G[A-Z]*)(.)[A-Z]{3}/

A match is returned if three capitals precede a single character or where the last match left off (optionally followed by other capitals).

Upvotes: 0

Lev Levitsky
Lev Levitsky

Reputation: 65791

I'm not sure how the regexp engines should work with multiple lookahead assertions, but the one you're using may have its own opinion on that.

You could as well use a single assertion as follows:

 '(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3}[^A-Z])'

The same with lookbehind:

 '(?<=[^A-Z][A-Z]{3})(.)(?=[A-Z]{3}[^A-Z])'

This will have a problem matching the pattern in the beginning and in the end of the line. I can't think of a proper solution, but there can be a dirty trick: for instance, add a space (or something else) in the beginning and the end of the whole line, then perform the matching.

$ echo 'ABCdDEF ABCfDEF HHHhhhHHHH AAAaAAAbAAA jjJJJJjJJJ JJJjJJJ' | sed 's/.*/ & /' | grep -oP '(?<=[^A-Z][A-Z]{3})(\S)(?=[A-Z]{3}[^A-Z])'
d
f
a
b
j

Note that I changed (.) to (\S) in the middle, change it back if you want the space to match.

P.S. Are you solving The Python Challenge? :)

Upvotes: 0

Derreck Dean
Derreck Dean

Reputation: 3766

Taking out the positive lookahead worked for me.

(?<![A-Z])[A-Z]{3}(.)([A-Z]{3})(?![A-Z])

'ABCdDEF' 'ABCfDEF' 'HHHhhhHHHH' 'jjJJjjJJJ' JJJjJJJ matches ABCdDEF ABCfDEF JJJjJJJ

Upvotes: 0

Related Questions