emilly
emilly

Reputation: 10530

Not getting * quantifier correctly in regex?

I am new to regex and I'm going through the regex quantifier section. I have a question about the * quantifier. Here is the definition of the * quantifier:

Based on the above definition, I wrote a small program:

public static void testQuantifier() {
    String testStr = "axbx";
    System.out.println(testStr.replaceAll("x*", "M"));
    //my expected output is MMMM but actual output is MaMMbMM
    /*
    Logic behind my expected output is:
    1. it encounters a which means 0 x is found. It should replace a with M.
    2. it encounters x which means 1 x is found. It should replace x with M.
    3. it encounters b which means 0 x is found. It should replace b with M.
    4. it encounters x which means 1 x is found. It should replace x with M.
    so output should be MMMM but why it is MaMMbMM?
    */

    System.out.println(testStr.replaceAll(".*", "M"));
    //my expected output is M but actual output is MM

    /*
    Logic behind my expected output is:
    It encounters axbx, which is any character sequence, it should 
    replace complete sequence with M.
    So output should be M but why it is MM?
    */
}

UPDATE:-

As per the revised understanding, I expect the output as MaMMbM but not MaMMbMM. So I'm not getting why I get an extra M in the end?

My revised understanding for the first regex is:

1. it encounters a which means 0 x is found. It should replace a with Ma.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with Mb.
4. it encounters x which means 1 x is found. It should replace x with M.
5. Lastly it encounters end of string at index 4. So it replaces 0x at end of String with M.

(Though I find it strange to consider also the index for end of string)

So the first part is clear now.

Also if somebody can clarify on the second regex, it would be helpful.

Upvotes: 2

Views: 124

Answers (2)

Tim Pietzcker
Tim Pietzcker

Reputation: 336078

a and b are not replaced because they are not matched by your regex. The xes and the empty strings before a non-matching letter or before the end of the string are replaced.

Let's see what happens:

  • We're at the start of the string. The regex engine tries to match an x but fails, because there is an a here.
  • The regex engine backtracks because x* also allows zero repetitions of x. We have a match and replace with M.
  • The regex engine advances past the a and successfully matches x. Replace by M.
  • The regex engine now tries to match x at the current position (after the previous match), which is right before b. It can't.
  • But it can backtrack again, matching zero xes here. Replace by M.
  • The regex engine advances past the b and successfully matches x. Replace by M.
  • The regex engine now tries to match x at the current position (after the previous match), which is at the end of the string. It can't.
  • But it can backtrack again, matching zero xes here. Replace by M.

This is implementation-dependent, by the way. In Python, for example, it's

>>> re.sub("x*", "M", "axbx")
'MaMbM'

because there, empty matches for the pattern are replaced only when not adjacent to a previous match.

Upvotes: 2

Jon Skeet
Jon Skeet

Reputation: 1499770

This is where you're going wrong:

first it encounters a which means 0 x is found. So it should replace a with M.

No - it means that 0 xs are found and then an a is found. You haven't said that the a should be replaced by M... you've said that any number of xs (including 0) should be replaced by M.

If you want every character to be replaced by M, you should just use .:

System.out.println(testStr.replaceAll(".", "23"));

(I would personally have expected a result of MaMbM - I'm looking into why you get MaMMbMM instead - I suspect it's because there's a sequence of 0 xs between the x and the b, but it still seems a little odd to me.)

EDIT: It becomes a bit clearer if you look at where your pattern matches. Here's code to show that:

Pattern pattern = Pattern.compile("x*");
Matcher matcher = pattern.matcher("axbx");
while (matcher.find()) {
    System.out.println(matcher.start() + "-" + matcher.end());
}

Results (bear in mind that the end is exclusive) with a bit of explanation:

0-0 (index 0 = 'a', doesn't match)
1-2 (index 1 = 'x', matches)
2-2 (index 2 = 'b', doesn't match)
3-4 (index 3 = 'x', matches)
4-4 (index 4 is the end of the string)

If you replace each of those matches with "M", you end up with the output you're actually getting.

I think the fundamental problem is that if you've got a pattern which can match (in its entirety) the empty string, you can argue that that pattern occurs an infinite number of times between any two characters in the input. I would probably try to avoid such patterns where possible - make sure that any match has to include at least one character.

Upvotes: 6

Related Questions