Not getting * quantifier correctly in regex?

Question

I am new to regex and I'm going through the regex quantifier section. I have a question about the * quantifier. Here is the definition of the * quantifier:

X* - Finds no or several letter X
.* - any character sequence

Based on the above definition, I wrote a small program:

public static void testQuantifier() {
    String testStr = "axbx";
    System.out.println(testStr.replaceAll("x*", "M"));
    //my expected output is MMMM but actual output is MaMMbMM
    /*
    Logic behind my expected output is:
    1. it encounters a which means 0 x is found. It should replace a with M.
    2. it encounters x which means 1 x is found. It should replace x with M.
    3. it encounters b which means 0 x is found. It should replace b with M.
    4. it encounters x which means 1 x is found. It should replace x with M.
    so output should be MMMM but why it is MaMMbMM?
    */

    System.out.println(testStr.replaceAll(".*", "M"));
    //my expected output is M but actual output is MM

    /*
    Logic behind my expected output is:
    It encounters axbx, which is any character sequence, it should 
    replace complete sequence with M.
    So output should be M but why it is MM?
    */
}

UPDATE:-

As per the revised understanding, I expect the output as MaMMbM but not MaMMbMM. So I'm not getting why I get an extra M in the end?

My revised understanding for the first regex is:

1. it encounters a which means 0 x is found. It should replace a with Ma.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with Mb.
4. it encounters x which means 1 x is found. It should replace x with M.
5. Lastly it encounters end of string at index 4. So it replaces 0x at end of String with M.

(Though I find it strange to consider also the index for end of string)

So the first part is clear now.

Also if somebody can clarify on the second regex, it would be helpful.

Jon Skeet · Accepted Answer

This is where you're going wrong:

first it encounters a which means 0 x is found. So it should replace a with M.

No - it means that 0 xs are found and then an a is found. You haven't said that the a should be replaced by M... you've said that any number of xs (including 0) should be replaced by M.

If you want every character to be replaced by M, you should just use .:

System.out.println(testStr.replaceAll(".", "23"));

(I would personally have expected a result of MaMbM - I'm looking into why you get MaMMbMM instead - I suspect it's because there's a sequence of 0 xs between the x and the b, but it still seems a little odd to me.)

EDIT: It becomes a bit clearer if you look at where your pattern matches. Here's code to show that:

Pattern pattern = Pattern.compile("x*");
Matcher matcher = pattern.matcher("axbx");
while (matcher.find()) {
    System.out.println(matcher.start() + "-" + matcher.end());
}

Results (bear in mind that the end is exclusive) with a bit of explanation:

0-0 (index 0 = 'a', doesn't match)
1-2 (index 1 = 'x', matches)
2-2 (index 2 = 'b', doesn't match)
3-4 (index 3 = 'x', matches)
4-4 (index 4 is the end of the string)

If you replace each of those matches with "M", you end up with the output you're actually getting.

I think the fundamental problem is that if you've got a pattern which can match (in its entirety) the empty string, you can argue that that pattern occurs an infinite number of times between any two characters in the input. I would probably try to avoid such patterns where possible - make sure that any match has to include at least one character.

Not getting * quantifier correctly in regex?

Answers (2)

Related Questions