Reputation: 10530
I am new to regex and I'm going through the regex quantifier section. I have a question about the *
quantifier. Here is the definition of the *
quantifier:
X*
- Finds no or several letter X.*
- any character sequenceBased on the above definition, I wrote a small program:
public static void testQuantifier() {
String testStr = "axbx";
System.out.println(testStr.replaceAll("x*", "M"));
//my expected output is MMMM but actual output is MaMMbMM
/*
Logic behind my expected output is:
1. it encounters a which means 0 x is found. It should replace a with M.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with M.
4. it encounters x which means 1 x is found. It should replace x with M.
so output should be MMMM but why it is MaMMbMM?
*/
System.out.println(testStr.replaceAll(".*", "M"));
//my expected output is M but actual output is MM
/*
Logic behind my expected output is:
It encounters axbx, which is any character sequence, it should
replace complete sequence with M.
So output should be M but why it is MM?
*/
}
UPDATE:-
As per the revised understanding, I expect the output as MaMMbM
but not MaMMbMM
. So I'm not getting why I get an extra M in the end?
My revised understanding for the first regex is:
1. it encounters a which means 0 x is found. It should replace a with Ma.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with Mb.
4. it encounters x which means 1 x is found. It should replace x with M.
5. Lastly it encounters end of string at index 4. So it replaces 0x at end of String with M.
(Though I find it strange to consider also the index for end of string)
So the first part is clear now.
Also if somebody can clarify on the second regex, it would be helpful.
Upvotes: 2
Views: 124
Reputation: 336078
a
and b
are not replaced because they are not matched by your regex. The x
es and the empty strings before a non-matching letter or before the end of the string are replaced.
Let's see what happens:
x
but fails, because there is an a
here.x*
also allows zero repetitions of x
. We have a match and replace with M
.a
and successfully matches x
. Replace by M
.x
at the current position (after the previous match), which is right before b
. It can't.x
es here. Replace by M
.b
and successfully matches x
. Replace by M
.x
at the current position (after the previous match), which is at the end of the string. It can't.x
es here. Replace by M
.This is implementation-dependent, by the way. In Python, for example, it's
>>> re.sub("x*", "M", "axbx")
'MaMbM'
because there, empty matches for the pattern are replaced only when not adjacent to a previous match.
Upvotes: 2
Reputation: 1499770
This is where you're going wrong:
first it encounters a which means 0 x is found. So it should replace a with M.
No - it means that 0 x
s are found and then an a
is found. You haven't said that the a
should be replaced by M
... you've said that any number of x
s (including 0) should be replaced by M
.
If you want every character to be replaced by M
, you should just use .
:
System.out.println(testStr.replaceAll(".", "23"));
(I would personally have expected a result of MaMbM
- I'm looking into why you get MaMMbMM
instead - I suspect it's because there's a sequence of 0 x
s between the x
and the b
, but it still seems a little odd to me.)
EDIT: It becomes a bit clearer if you look at where your pattern matches. Here's code to show that:
Pattern pattern = Pattern.compile("x*");
Matcher matcher = pattern.matcher("axbx");
while (matcher.find()) {
System.out.println(matcher.start() + "-" + matcher.end());
}
Results (bear in mind that the end is exclusive) with a bit of explanation:
0-0 (index 0 = 'a', doesn't match)
1-2 (index 1 = 'x', matches)
2-2 (index 2 = 'b', doesn't match)
3-4 (index 3 = 'x', matches)
4-4 (index 4 is the end of the string)
If you replace each of those matches with "M", you end up with the output you're actually getting.
I think the fundamental problem is that if you've got a pattern which can match (in its entirety) the empty string, you can argue that that pattern occurs an infinite number of times between any two characters in the input. I would probably try to avoid such patterns where possible - make sure that any match has to include at least one character.
Upvotes: 6