Zahid Iqbal
Zahid Iqbal

Reputation: 31

Java recursive/repeated regex

I am trying to replace all .(periods) with keyword XXX which lie within an alphanumeric word in a large text.

For example: I am trying to match a.b.c.d.e ...
Expected output: I am trying to match aXXXbXXXcXXXdXXXe ...

Pattern I used: (\w+)([\.]+)(\w+)
Actual result: I am trying to match aXXXb.cXXXd.e ...

How can I get expected output via regex without using any code/stubs.

Upvotes: 2

Views: 103

Answers (3)

scriber36
scriber36

Reputation: 135

Solution 1: Match Dots Together + Use Replace Function

If it's possible, I'd suggest a little bit different approach of using regex:

@Test
public void test_regex_replace() {
    var input = "I am trying to match a.b.c.d.e ...";
    var expectedOutput = "I am trying to match aXXXbXXXcXXXdXXXe ...";
    var regex = Pattern.compile("((\\w+)([\\.]))+(\\w+)");
    var output = regex.matcher(input).replaceAll(match -> match.group().replace(".", "XXX"));
    assertEquals(expectedOutput, output);
}

Notice how I changed the pattern:
(\w+) ([\.]+) (\w+)
((\w+)([\.]+))+ (\w+)
So it matches on words containing multiple dots. Notice, how it replaces a..b to aXXXXXXb instead of aXXXb; if you want otherwise, you must modify the lambda a little bit, e.g.:

regex.matcher(input).replaceAll(match -> match.group().replaceAll("\\.+", "XXX"));

or something more performant, which replaces any number of subsequent dots to only one XXX:

@Test
public void test_regex_replace() {
    final String input = "I am trying to match a.b.c.d.e ...";
    final String expectedOutput = "I am trying to match aXXXbXXXcXXXdXXXe ...";
    final Pattern regex = Pattern.compile("(?:(\\w+)\\.+)+\\w+");
    final String output = regex.matcher(input).replaceAll(match -> {
        final String matchText = match.group();
        final int matchTextLength = matchText.length();
        final var sb = new StringBuilder();
        int lastEnd = 0;
        while (lastEnd < matchTextLength) {
            int endOfWord = lastEnd;
            while (endOfWord < matchTextLength && matchText.charAt(endOfWord) != '.') {
                endOfWord += 1;
            }
            sb.append(matchText, lastEnd, endOfWord);
            int endOfDots = endOfWord;
            endOfDots = asd(endOfDots, matchTextLength, matchText);
            if (endOfDots != endOfWord) {
                sb.append("XXX");
            }
            lastEnd = endOfDots;
        }
        return sb.toString();
    });
    assertEquals(expectedOutput, output);
}

This avoids the problem of reusing some characters as both the left and right side of the dot by matching them together. Not sure about the performance, but it does not use any lookarounds, so I expect it to perform rather well.


Solution 2: Using Word Boundary

You mentioned "without using any code/stubs", so this might not fit your problem, but otherwise you must use lockarounds. Other than these, the only thing I can think of is using \b (word boundary symbol) in the regex, like so:

@Test
public void test_regex_replace() {
    final String input = "I am trying to match a.b.c.d.e ...";
    final String expectedOutput = "I am trying to match aXXXbXXXcXXXdXXXe ...";
    final String output = input.replaceAll("\\b\\.+\\b", "XXX");
    assertEquals(expectedOutput, output);
}

Upvotes: 0

anubhava
anubhava

Reputation: 784998

You can use lookarounds:

str = str.replaceAll("(?<=[a-zA-Z0-9])\\.(?=[a-zA-Z0-9])", "XXX");

RegEx Demo

Lookaround Reference

Upvotes: 1

Raman Shrivastava
Raman Shrivastava

Reputation: 2953

Why don't you do something like if you want to change all . -

str = str.replaceAll("\\.", "XXX");

Or below if you don't want to change . if any first or last index -

str = str.replaceAll("\\.", "XXX").replaceAll("^XXX", ".").replaceAll("XXX$", ".");

Upvotes: 0

Related Questions