toobee
toobee

Reputation: 2752

RegEx: Matching n-char long sequence of repeating character

I want to split of a text string that might look like this:

(((Hello! --> ((( and Hello!

or ########No? --> ######## and No?

At the beginning I have n-times the same special character, but I want to match the longest possible sequence.

What I have at the moment is this regex: ([^a-zA-Z0-9])\\1+([a-zA-Z].*)

This one would return for the first example ( (only 1 time) and Hello!

and for the second # and No!

How do I tell regEx I want the maximal long repetition of the matching character?

I am using RegEx as part of a Java program in case this matters.

Upvotes: 0

Views: 91

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

I suggest the following solution with 2 regexps: (?s)(\\W)\\1+\\w.* for checking if the string contains same repeating non-word symbols at the start, and if yes, split with a mere (?<=\\W)(?=\\w) pattern (between non-word and a word character), else, just return a list containing the whole string (as if not split):

String ptrn = "(?<=\\W)(?=\\w)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
    if (str.matches("(?s)(\\W)\\1+\\w.*")) {
        System.out.println(Arrays.toString(str.split(ptrn)));
    }else { System.out.println(Arrays.asList(str)); }
}

See IDEONE demo

Result:

[(((, Hello!]
[########, No?]
[$%^&^Hello!]

Also, your original regex can be modified to fit the requirement like this:

String ptrn = "(?s)((\\W)\\2+)(\\w.*)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
    Pattern p = Pattern.compile(ptrn);
    Matcher m = p.matcher(str);
    if (m.matches()) {
        System.out.println(Arrays.asList(m.group(1), m.group(3)));
    }
    else { 
        System.out.println(Arrays.asList(str)); 
    }
}

See another IDEONE demo

That regex matches:

  • (?s) - DOTALL inline modifier (if the string has newline characters, .* will also match them).
  • ((\\W)\\2+) - Capture group 1 matching and capturing into Group 2 a non-word character followed by the same character (since a backreference \2 is used) 1 or more times.
  • (\\w.*) - matches and captures into Group 3 a word character and then one or more characters.

Upvotes: 1

Related Questions