Reputation: 598

Ignore existing spaces in converting CamelCase to string with spaces

I want to split camelCase or PascalCase words to space separate collection of words.

So far, I have:

Regex.Replace(value, @"(\B[A-Z]+?(?=[A-Z][^A-Z])|\B[A-Z]+?(?=[^A-Z]))", " $0", RegexOptions.Compiled);

It works fine for converting "TestWord" to "Test Word" and for leaving single words untouched, e.g. Testing remains Testing.

However, ABCTest gets converted to A B C Test when I would prefer ABC Test.

Upvotes: 8

Answers (3)

Wiktor Stribiżew

Reputation: 627380

Here is my attempt:

(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b)|(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b)

This regex can be used with Regex.Replace and $0 as a replacement string.

Regex.Replace(value, @"(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b)|(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b)", " $0", RegexOptions.Compiled);

See demo

Regex Explanation:

Contains 2 alternatives to account for a chain of capital letters before or after lowercase letters.
(?<!^|\b|\p{Lu})\p{Lu}+(?=\p{Ll}|\b) - first alternative that matches several uppercase letters that are not preceded with a start of string, word boundary or another uppercase letter, and that are followed by a lowercase letter or a word boundary,
(?<!^\p{Lu}*|\b)\p{Lu}(?=\p{Ll}|(?<!\p{Lu}*)\b) - the second alternative that matches a single capital letter that is not preceded with a start of string with optional uppercase letters right after, or word boundary and is followed by a lowercase letter or a word boundary that is not preceded by optional uppercase letters.

Upvotes: 1

Jon Rea

Reputation: 9455

Do you have a requirement to use Regex? To be honest, I wouldn't use Regex for this at all. They're hard to debug and not especially readable.

You also sometimes end up with all sorts of fun like this: Regex problem: IsMatch method never returns
The regex above will not deal with the wonderful world of unicode - e.g. Cyrillics (http://en.wikipedia.org/wiki/Cyrillic_script) (not that your specific problem domain probably needs this, but for completeness...)

I would go with a small, reusable, easily testable extension method:

class Program
{
    static void Main(string[] args)
    {
        string[] inputs = new[]
        {
            "ABCTest",
            "HelloWorld",
            "testTest$Test",
            "aaҚbb"
        };

        var output = inputs.Select(x => x.SplitWithSpaces(CultureInfo.CurrentUICulture));

        foreach (string x in output)
        {
            Console.WriteLine(x);
        }

        Console.Read();
    }
}

public static class StringExtensions
{
    public static bool IsLowerCase(this TextInfo textInfo, char input)
    {
        return textInfo.ToLower(input) == input;
    }

    public static string SplitWithSpaces(this string input, CultureInfo culture = null)
    {
        if (culture == null)
        {
            culture = CultureInfo.InvariantCulture;
        }
        TextInfo textInfo = culture.TextInfo;

        StringBuilder sb = new StringBuilder(input);

        for (int i = 1; i < sb.Length; i++)
        {
            int previous = i - 1;

            if (textInfo.IsLowerCase(sb[previous]))
            {
                int insertLocation = previous - 1;

                if (insertLocation > 0)
                {
                    sb.Insert(insertLocation, ' ');
                }

                while (i < sb.Length && textInfo.IsLowerCase(sb[i]))
                {
                    i++;
                }
            }                
        }

        return sb.ToString();
    }
}

Upvotes: 0

thodic

Reputation: 2269

Try:

[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z])|[a-z]+|[A-Z]+

An example on Regex101

How is it used in CS?

string strText = " TestWord asdfDasdf  ABCDef";
        
string[] matches = Regex.Matches(strText, @"[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z])|[a-z]+|[A-Z]+")
                .Cast<Match>()
                .Select(m => m.Value)
                .ToArray();
            
string result = String.Join(" ", matches);

result = 'Test Word asdf Dasdf ABC Def'

How it works

In the example string:

TestWord qwerDasdf
ABCTest Testing    ((*&^%$CamelCase!"£$%^^))
asdfAasdf
AaBbbCD

[A-Z][a-z]+ matches:

[0-4] Test
[4-8] Word
[13-18] Dasdf
[22-26] Test
[27-34] Testing
[45-50] Camel
[50-54] Case
[68-73] Aasdf
[74-76] Aa
[76-79] Bbb

[A-Z]+(?=[A-Z][a-z]) matches:

[19-22] ABC

[a-z]+ matches:

[9-13] qwer
[64-68] asdf

[A-Z]+ matches:

[79-81] CD

Upvotes: 4

Ignore existing spaces in converting CamelCase to string with spaces

Answers (3)

How is it used in CS?

How it works

Related Questions