Pavel Kovalev
Pavel Kovalev

Reputation: 8116

How to find repeatable characters

I can't understand how to solve the following problem:

I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters) Expected result is 0 1 2 5

This problem is already solved by me using another approach (non-RegEx). But I need a RegEx I'm new to RegEx so google-search can't help me really. Any help appreciated! Thanks!

P.S. I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong

Upvotes: 2

Views: 268

Answers (3)

jessehouwing
jessehouwing

Reputation: 114641

The following code should work with any regular expression without having to change the actual expression:

        Regex rx = new Regex("(a)\1"); // or any other word you're looking for.

        int position = 0;
        string text = "aaaaabbbbccccaaa";
        int textLength = text.Length;

        Match m = rx.Match(text, position);

        while (m != null && m.Success)
        {
            Console.WriteLine(m.Index);

            if (m.Index <= textLength)
            {
                m = rx.Match(text, m.Index + 1);
            }
            else
            {
                m = null;
            }
        }

        Console.ReadKey();

It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.

Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:

        string expression = "(a)\1"
        Regex rx2 = new Regex("(?=" + expression + ")");
        MatchCollection ms = rx2.Matches(text);
        var indexes = ms.Cast<Match>().Select(match => match.Index);

That way the engine will automatically advance the index by one for every match it finds.

From the docs:

When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.

Upvotes: 0

stema
stema

Reputation: 92976

You can solve this by using a lookahead

a(?=a)

will find every "a" that is followed by another "a".

If you want to do this more generally

(\p{L})(?=\1)

This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)

\p{L} is a unicode code point with the category "letter"

Code

String text = "aaaabaa";
Regex reg = new Regex(@"(\p{L})(?=\1)");

MatchCollection result = reg.Matches(text);

foreach (Match item in result) {
    Console.WriteLine(item.Index);
}

Output

0
1
2
5

Upvotes: 2

Matthew
Matthew

Reputation: 5212

Try this:

How can I find repeated characters with a regex in Java?

It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

Upvotes: -1

Related Questions