ak_charlie
ak_charlie

Reputation: 53

Java regex to remove duplicate substrings from string

I'm trying to build a regex to "reduce" duplicate consecutive substrings from a string in Java. For example, for the following input:

The big black dog big black dog is a friendly friendly dog who lives nearby nearby.

I'd like to get the following output:

The big black dog is a friendly dog who lives nearby.

This is the code I have so far:

String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

Pattern dupPattern = Pattern.compile("((\\b\\w+\\b\\s)+)\\1+", Pattern.CASE_INSENSITIVE);
Matcher matcher = dupPattern.matcher(input);

while (matcher.find()) {
    input = input.replace(matcher.group(), matcher.group(1));
}

Which is working out fine for all duplicate substrings except for the end of the sentence:

The big black dog is a friendly dog who lives nearby nearby.

I understand that my regex requires a whitespace after each word in the substring, meaning it won't catch cases with a period instead of a space. I can't seem to find a workaround for this, I have tried playing around with the capture groups and also changing the regex to look for a whitespace or a period instead of just a whitespace, but this solution will only work if there is a period after each duplicate part of the substring ("nearby.nearby.").

Can somebody point me in the right direction? Ideally the inputs for this method will be short paragraphs and not just one-liners.

Upvotes: 5

Views: 2737

Answers (2)

Eugene
Eugene

Reputation: 11075

Combine both @Thomas Ayoub's answer and @Matt's comment.

public class Test2 {
    public static void main(String args[]){
        String input = "The big big black dog big black dog is a friendly friendly dog who lives nearby nearby.";
        String result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        while(!input.equals(result)){
            input = result;
            result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        }
        System.out.println(result);
    }
}

Upvotes: 2

Thomas Ayoub
Thomas Ayoub

Reputation: 29431

You can use

input.replaceAll("([ \\w]+)\\1", "$1");

See live demo:

import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

        Pattern dupPattern = Pattern.compile("([ \\w]+)\\1", Pattern.CASE_INSENSITIVE);
        Matcher matcher = dupPattern.matcher(input);

        while (matcher.find()) {
            input = input.replaceAll("([ \\w]+)\\1", "$1");
        }
        System.out.println(input);

    }
}

Upvotes: 3

Related Questions