Reputation: 57
I have a pattern to find repeated patterns in a string:
String patt = "(.+?)\\1+";
Unfortunately this matches "003003003" as "0". What would I need to modify to make this match as "003"?
Upvotes: 2
Views: 2079
Reputation:
From OP comment: But in case I have "000000" I would want "0"
modified:
Repeating group of different characters will be match first if possible.
It turns out (I believe) that finding the first different characters,
results in the largest distinct group to be repeatedly matched.
This enables the use of non-greedy quantifiers that would enable this
to be used on large strings without a performance hit.
Interpreting results:
(Note - if an engine that supports branch reset is used, all results can be
obtained from the same group)
Group 0 contains the entire match, then;
Either:
Regex: ((.)+?(?!\2).+?)\1+|(.)\3+
Expanded:
( # (1 start), Different chars (at least 2) repeating group
( . )+? # (2), Shortest distance sample character
(?! \2 ) # Border between different characters
.+? # First different, shortest distance
) # (1 end)
\1+
| # or
( . ) # (3), Same character repeating group
\3+
Input:
000010910910901012222
Output:
** Grp 0 - ( pos 0 , len 4 )
0000
** Grp 1 - NULL
** Grp 2 - NULL
** Grp 3 - ( pos 0 , len 1 )
0
-------------
** Grp 0 - ( pos 4 , len 9 )
109109109
** Grp 1 - ( pos 4 , len 3 )
109
** Grp 2 - ( pos 5 , len 1 )
0
** Grp 3 - NULL
-------------
** Grp 0 - ( pos 13 , len 4 )
0101
** Grp 1 - ( pos 13 , len 2 )
01
** Grp 2 - ( pos 13 , len 1 )
0
** Grp 3 - NULL
-------------
** Grp 0 - ( pos 17 , len 4 )
2222
** Grp 1 - NULL
** Grp 2 - NULL
** Grp 3 - ( pos 17 , len 1 )
2
Using Branch Reset: Group 0 = entire match, Group 1 = repeated group.
# (?|((.)+?(?!\2).+?)\1+|(.)\1+)
(?|
( # (1 start), Different chars (at least 2) repeating group
( . )+? # (2), Shortest distance sample character
(?! \2 ) # Border between different characters
.+? # First different, shortest distance
) # (1 end)
\1+
| # or
( . ) # (1), Same character repeating group
\1+
)
Using single lookahead for overlapped matches.
Group 1 = entire match, Group 2 = different char's repeated group, Group 4 single char repeated.
# (?=(((.)+?(?!\3).+?)\2+|(.)\4+))
(?=
( # (1 start), Capture entire match
( # (2 start), Different chars (at least 2) repeating group
( . )+? # (3), Shortest distance sample character
(?! \3 ) # Border between different characters
.+? # First different, shortest distance
) # (2 end)
\2+
| # or
( . ) # (4), Same character repeating group
\4+
) # (1 end)
)
Using Branch Rreset and single lookahead for overlapped matches.
Group 1 = entire match, Group 2 = repeated group.
# (?=((?|((.)+?(?!\3).+?)\2+|(.)\2+)))
(?=
( # (1 start), Capture entire match
(?|
( # (2 start), Different chars (at least 2) repeating group
( . )+? # (3), Shortest distance sample character
(?! \3 ) # Border between different characters
.+? # First different, shortest distance
) # (2 end)
\2+
| # or
( . ) # (2), Same character repeating group
\2+
)
) # (1 end)
)
Upvotes: 0
Reputation: 89557
A "theorical" regex to do that can be:
(?=(?<sub>(.+)(\\2+)))(?=(?<pattern>.+?)(?:\\4+\\3|\\2))
(in DOTALL mode)
This regex will find the largest repeated substring for each position in the string.
I say "theorical" because in practice you can't use it in a long string since the number of steps needed to parse the entire string grow very quickly with the size of the string. So a good compromise is to limit the size of the searched substrings, for example like this:
(?=(?<sub>(.{1,20})(\\2+)))(?=(?<pattern>.{1,10}?)(?:\\4+\\3|\\2))
pattern explanation:
The regex is divided in two lookahead assertions that test the string at the same position (since a lookakead is a zero-width assertion).
The first lookahead try to find a substring (called "sub") that is a repeated sequence with the largest pattern (group 2). That's why a greedy quantifier is used. Note that the end of the substring (\\2+)
(after the first pattern) is captured in group 3, because the second lookahead will need it.
The goal of the second lookahead is to check what is the shortest pattern for the same substring. This time, a non-greedy quantifier is used, and the lookahead succeeds when the group 3 is reached after one or more repetitions of a new smaller pattern, or when the largest pattern is reached (in this case there is no smaller patterns).
pattern details:
(?= # first lookahead
(?<sub> # group "sub" (or group 1) with the whole substring
(.+) # group 2 with the largest pattern
(\\2+) # group 3 with the end of the substring
)
)
(?= # second lookahead
(?<pattern>.+?) # the shortest pattern (group 4) ...
(?: # non-capturing group: ... must be followed by
\\4+\\3 # one or more repetitions until the group 3
| # OR
\\2 # until the largest pattern (group 2)
)
)
demo:
import java.util.regex.*;
class repeatedSubstrings
{
public static void main (String[] args)
{
String s = "0070070071071071" + "\n"
+ "Now is the winter of our discontent" + "\n"
+ "Made glorious summer by this sun of York;" + "\n"
+ "And all the clouds that lour'd upon our house" + "\n"
+ "In the deep bosom of the ocean buried." + "\n"
+ "Now are our brows bound with victorious wreaths;" + "\n"
+ "Our bruised arms hung up for monuments;" + "\n"
+ "Our stern alarums changed to merry meetings," + "\n"
+ "Our dreadful marches to delightful measures." + "\n"
+ "00000000" + "\n"
+ "42424242";
Pattern Reg = Pattern.compile("(?=(?<sub>(.+)(\\2+)))(?=(?<pattern>.+?)(?:\\4+\\3|\\2))", Pattern.DOTALL);
Matcher m = Reg.matcher(s);
int prevEnd = 0;
int currEnd;
long start = System.currentTimeMillis();
StringBuilder display = new StringBuilder("substring pattern \tstart offset\tend offset\tlength\n");
display.append("--------------------------------------------------------------------------------\n");
while (m.find()) {
currEnd = m.start() + m.group("sub").length();
if (currEnd > prevEnd) {
display.append(String.format("%-15s%-10s\t%-15d\t%-15d\t%d\n", m.group("sub"), m.group("pattern"), m.start(), currEnd, m.group("sub").length()));
prevEnd = currEnd;
}
}
long end = System.currentTimeMillis();
System.out.println(display.toString());
System.out.printf("Elapsed Time: %.1es\n", (end-start)/1000.0);
}
}
Note: this regex returns the largest substring with repeated patterns for each position in the string. So overlapping substrings are returned too. with the string 0070070071071071
the substrings are 007007007
and 071071071
. If you don't want this behaviour, you can simply add \\1
at the end of the regex to get 007007007
and 107107
.
The example code excludes results that start after a previous result and end before or at the same offset in the string. If you want to get all these results, remove the if
statement.
Upvotes: 0
Reputation: 174706
Just remove the question mark ie, make your regex as greedy.
Pattern patt = Pattern.compile("(.+)\\1+");
Why your regex returns only 0
and 300
after printing the group index 1?
Because .+?
forces the regex engine to match any character one or more times non-greedily. So this matches the first 0
and checks whether there is another one or more zeros or following. Yeh, it's there. So it captures the first zero and matches the second zero.
Now it takes the third character and do checking.
OR
Use anchors.
^(.+?)\\1+$
Upvotes: 3