Reputation: 1499
I will be very thankful if someone will help me to create regex to find all *
and **
entries in String. Because I have no more idea how to build this regex. In string we can have only *
and .
For example here:
*..***...**...****....*....*****..**
We have 4x*
and 7x**
This is what I already have:
Pattern oneStarPattern = Pattern.compile("(^|\\.|(\\*{2})+)\\*(\\.|$)");
Pattern twoStarsPattern = Pattern.compile("(^|\\.|(\\*{2})+)\\*{2}(\\.|\\*|$)");
And on output I have 5x*
and 5x**
what is wrong.
Upvotes: 1
Views: 72
Reputation: 124225
Before I explain what you did wrong and how you can correct it here is one very simple way how you can solve your problem with just one regular expression. Idea is to let regex try to match first **
and if this will be not possible then try to match only *
. Such regex can look like
\\*\\*|\\*
because matcher test options of OR
from left to right, so in case of data like
***
matcher will find first try to find match for \\*\\*
and it will succeed so it will consume first two asterixes **
***
^^
After this matcher will go forward and again will try to check if \\*\\*
can be matched here, but since this time there is only one *
, \\*\\*
wouldn't be matched so matcher will try to test other option in regex which is \\*
. So this time matcher will return only one *
.
***
^
And so on.
Code for such application can look like
String data = "*..***...**...****....*....*****..**";
Pattern p = Pattern.compile("\\*\\*|\\*");
Matcher m = p.matcher(data);
int tmp1 = 0, tmp2 = 0;
while (m.find()) {
if (m.group().length() == 1)//found *
tmp1++;
else //found **
tmp2++;
}
System.out.println(tmp1);
System.out.println(tmp2);
Output:
4
7
Now lets focus on your current regexes.
Your first regex (^|\\.|(\\*{2})+)\\*(\\.|$)
accepts only one *
which have
^
.
*
before it, and
.
$
after it.
Strategy which accept *
as long as it has even numbers of *
before it and .
or $
after it has one flaw, because in case
****.
^^^^
part marked with ^
will also be matched (while it shouldn't).
This is why this regex matches data marked with ^
and #
where marked with #
is not supposed to be there:
*..***...**...****....*....*****..**
^^ ^^^^ #### ^^^ ^^^^^^
and you are seeing 5
matches.
Another possible problem is your regex consumes surrounding elements so they can't be reused in next try to find next matches, so in case of
*.*.
^^
first *.
will be matched, but .
will be included in this match which prevents regex in using it while testing second *.
. Because second *.
can't include first .
(used in previous match) in its match regex will not be correct, because *
has no ^
, (\\*{2})+)
, or free to use .
before it.
So in reality even .
aren't supposed to be included in match
*..***...**...****....*....*****..**
^# ^^^# #### #^# ^^^^^#
To get rid of these problems you can use look-around mechanisms and change your regex to something like
"(?<=^|\\.)(\\*{2})*\\*(?=\\.|$)"
This regex will find
*
((\\*{2})*\\*
).
before it (?<=^|\\.)
.
or end of string after it (?=\\.|$)
(^|\\.|(\\*{2})+)\\*{2}(\\.|\\*|$)
This regex have similar problems as first one. Lets see what it currently matches
*..***...**...****....*....*****..**
^^^^ ^^^^ ^^^^ ^^^^ ^^^
There is something wrong with each match because
.
*
at the end, preventing next match from using it(^|\\.|(\\*{2})+)\\*{2}
will search for maximal possible even number of asterixes (because of (\\*{2})+
), not in one pair This regex is very good example of overcomplicating things. It may seems little harder to fix than first one but in reality it is very simple.
You just need to use \\*\\*
regex. It will match only pairs of asterixes, return each of them and look for the next one. This regex is safe, because you can't reuse already matched **
, so it will match
*********
11223344x
where 1
2
3
4
represents what will be returned in each iteration of match, and *
corresponding to x
will not be matched at all.
Upvotes: 1