Reputation: 150996

Why do variations of regular expression lookaround -- lookahead and lookbehind not work for commafication?

The following works in Ruby for commafication (adding , to a number, so 12345 becomes 12,345)

def r(s)
  s.gsub(/(?<=\d)(?=(\d\d\d)+\b)/, ",")
end

s = ""
1.upto(20) do |i|
  s += (i % 10).to_s
  puts r(s)
end

But I wonder why the variations r2 and r3 won't work?

def r2(s)
  s.gsub(/(?<=\d)(?=(\d\d\d)+)\b/, ",")
end

def r3(s)
  s.gsub(/(?<=\d)(?=\d\d\d)+\b/, ",")
end

Nothing is modified at all, and I would think that 1234 does match (?<=\d)(?=(\d\d\d)+)\b so it is a bit strange. (I tried it using Perl as well, so it is not peculiar to Ruby).

Update: The following is the output for r, while for r2 and r3, no , is added at all:

1
12
123
1,234
12,345
123,456
1,234,567
12,345,678
123,456,789
1,234,567,890
12,345,678,901
123,456,789,012
1,234,567,890,123
12,345,678,901,234
123,456,789,012,345
1,234,567,890,123,456
12,345,678,901,234,567
123,456,789,012,345,678
1,234,567,890,123,456,789
12,345,678,901,234,567,890

Upvotes: 1

Answers (3)

JDB

Reputation: 25810

Well, in r2, your lookahead is saying that the next three characters must be digits, but then you immediately try to match a word boundary. They are mutually exclusive.

In r3, you are repeating the lookahead one or more times, but, being a lookahead, this is nonsense. You are repeating "the next three characters must be digits" over and over, but they either will be or won't be. Stating it more than once is non-sense. And you still have the problem with the word boundary.

A lookahead is like a peek function on a stack. It doesn't move the pointer forward because it doesn't consume anything. It matches a position (think of that as the space in-between characters). So your lookahead is matching the position where three digits follow. But then the next statement (the \b) is matching a position where the character on the left is a word character (typically [a-zA-Z0-9_] or something like that) and the character to the right is not (whitespace, a period, etc.) or vice versa. Since the previous lookbehind requires that there be digit preceding the position, and the lookahead requires a sequence of digits, then it is impossible to ever have a word boundary at the defined position.

Example

The following regex will always fail:

^(?=\d\d\d)\d\d\b

The ^ says that the match must start at the beginning of the input. The lookahead asserts that the next three characters must be digits (but does not consume them). The following expression says that the next two characters must be digits (and consumes them, moving the pointer forward), followed by a non-digit (the word boundary). But this violates the lookahead which required the next three characters must be digits. Thus, the match fails.

See: http://www.regular-expressions.info/lookaround.html

Upvotes: 5

davidrac

Reputation: 10738

What r2 says in words is "match a word boundary that has a number of digits after it that is divisible by 3 and has a digit before it". This is a contradiction, since no boundary can have digits before AND after it. It would not be a boundary. Therefore, there is nothing this expression can match.

Upvotes: 1

Bergi

Reputation: 664434

r2 and r3 have the word boundary \b directly after the lookahead, not inside. This does never match, as you also want to have it preceded by a digit - it's certainly inside a word.

Btw, I'd consider the + after a lookahead as invalid. If a lookahead matches, it would match repeatedly of course. If you want the repetition of 3 digits, it must be inside the lookahead.

Upvotes: 0

Why do variations of regular expression lookaround -- lookahead and lookbehind not work for commafication?

Answers (3)

Related Questions