Pat
Pat

Reputation: 16891

Split a PascalCase string into separate words

I am looking for a way to split PascalCase strings, e.g. "MyString", into separate words - "My", "String". Another user posed the question for bash, but I want to know how to do it with general regular expressions or at least in .NET.

Bonus if you can find a way to also split (and optionally capitalize) camelCase strings: e.g. "myString" becomes "my" and "String", with the option to capitalize/lowercase either or both of the strings.

Upvotes: 20

Views: 13976

Answers (10)

Sooraj kumar
Sooraj kumar

Reputation: 31

string.Concat(str.Select(x => Char.IsUpper(x) ? " " + x : x.ToString())).TrimStart(' ').Dump();

This is far better approach then using Regex, Dump is just to print to console

Upvotes: 1

JEM
JEM

Reputation: 151

    public static string PascalCaseToSentence(string input)
    {
        if (input == null) return "";

        string output = Regex.Replace(input, @"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[A-Za-z])(?=[^A-Za-z])", m => " " + m.Value);
        return output;
    }

Based on Shimmy's answer.

Upvotes: 0

Brent
Brent

Reputation: 4876

with the aims of

  • a) Creating a function which optimised performance
  • b) Have my own take on CamelCase in which capitalised acronyms were not separated (I fully accept this is not the standard definition of camel or pascal case, but it is not an uncommon usage) : "TestTLAContainingCamelCase" becomes "Test TLA Containing Camel Case" (TLA = Three Letter Acronym)

I therefore created the following (non regex, verbose, but performance oriented) function

public static string ToSeparateWords(this string value)
{
    if (value==null){return null;}
    if(value.Length <=1){return value;}
    char[] inChars = value.ToCharArray();
    List<int> uCWithAnyLC = new List<int>();
    int i = 0;
    while (i < inChars.Length && char.IsUpper(inChars[i])) { ++i; }
    for (; i < inChars.Length; i++)
    {
        if (char.IsUpper(inChars[i]))
        {
            uCWithAnyLC.Add(i);
            if (++i < inChars.Length && char.IsUpper(inChars[i]))
            {
                while (++i < inChars.Length) 
                {
                    if (!char.IsUpper(inChars[i]))
                    {
                        uCWithAnyLC.Add(i - 1);
                        break;
                    }
                }
            }
        }
    }
    char[] outChars = new char[inChars.Length + uCWithAnyLC.Count];
    int lastIndex = 0;
    for (i=0;i<uCWithAnyLC.Count;i++)
    {
        int currentIndex = uCWithAnyLC[i];
        Array.Copy(inChars, lastIndex, outChars, lastIndex + i, currentIndex - lastIndex);
        outChars[currentIndex + i] = ' ';
        lastIndex = currentIndex;
    }
    int lastPos = lastIndex + uCWithAnyLC.Count;
    Array.Copy(inChars, lastIndex, outChars, lastPos, outChars.Length - lastPos);
    return new string(outChars);
}

What was most surprising was the performance tests. using 1 000 000 iterations per function

regex pattern used = "([a-z](?=[A-Z])|[A-Z](?=[A-Z][a-z]))"
test string = "TestTLAContainingCamelCase":
static regex:      13 302ms
Regex instance:    12 398ms
compiled regex:    12 663ms
brent(above):         345ms
AndyRose:           1 764ms
DanTao:               995ms

the Regex instance method was only slightly faster than the static method, even over a million iterations (and I can't see the benefit of using the RegexOptions.Compiled flag), and Dan Tao's very succinct code was almost as fast as my much less clear code!

Upvotes: 3

Andy Rose
Andy Rose

Reputation: 16984

Just to provide an alternative to the RegEx and looping solutions all ready provided here is an answer using LINQ which also handles camel case and acronyms:

    string[] testCollection = new string[] { "AutomaticTrackingSystem", "XSLT", "aCamelCaseWord" };
    foreach (string test in testCollection)
    {
        // if it is not the first character and it is uppercase
        //  and the previous character is not uppercase then insert a space
        var result = test.SelectMany((c, i) => i != 0 && char.IsUpper(c) && !char.IsUpper(test[i - 1]) ? new char[] { ' ', c } : new char[] { c });
        Console.WriteLine(new String(result.ToArray()));
    }

The output from this is:

Automatic Tracking System  
XSLT  
a Camel Case Word 

Upvotes: 14

chilltemp
chilltemp

Reputation: 8962

See this question: Is there a elegant way to parse a word and add spaces before capital letters? Its accepted answer covers what you want, including numbers and several uppercase letters in a row. While this sample has words starting in uppercase, it it equally valid when the first word is in lowercase.

string[] tests = {
   "AutomaticTrackingSystem",
   "XMLEditor",
   "AnXMLAndXSLT2.0Tool",
};


Regex r = new Regex(
    @"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[A-Za-z])(?=[^A-Za-z])"
  );

foreach (string s in tests)
  r.Replace(s, " ");

The above will output:

[Automatic][Tracking][System]
[XML][Editor]
[An][XML][And][XSLT][2.0][Tool]

Upvotes: 29

Aaron Butacov
Aaron Butacov

Reputation: 34347

Check that a non-word character comes at the beginning of your regex with \W and keep the individual strings together, then split the words.

Something like: \W([A-Z][A-Za-z]+)+

For: sdcsds sd aCamelCaseWord as dasd as aSscdcacdcdc PascelCase DfsadSsdd sd Outputs:

48: PascelCase
59: DfsadSsdd

Upvotes: 0

Pat
Pat

Reputation: 16891

var regex = new Regex("([A-Z]+[^A-Z]+)");
var matches = regex.Matches("aCamelCaseWord")
    .Cast<Match>()
    .Select(match => match.Value);
foreach (var element in matches)
{
    Console.WriteLine(element);
}

Prints

Camel
Case
Word

(As you can see, it doesn't handle camelCase - it dropped the leading "a".)

Upvotes: 1

Dan Tao
Dan Tao

Reputation: 128337

How about:

static IEnumerable<string> SplitPascalCase(this string text)
{
    var sb = new StringBuilder();
    using (var reader = new StringReader(text))
    {
        while (reader.Peek() != -1)
        {
            char c = (char)reader.Read();
            if (char.IsUpper(c) && sb.Length > 0)
            {
                yield return sb.ToString();
                sb.Length = 0;
            }

            sb.Append(c);
        }
    }

    if (sb.Length > 0)
        yield return sb.ToString();
}

Upvotes: 5

Pat
Pat

Reputation: 16891

Answered in a different question:

void Main()
{
    "aCamelCaseWord".ToFriendlyCase().Dump();
}

public static class Extensions
{
    public static string ToFriendlyCase(this string PascalString)
    {
        return Regex.Replace(PascalString, "(?!^)([A-Z])", " $1");
    }
}

Outputs a Camel Case Word (.Dump() just prints to the console).

Upvotes: 9

Ken Bloom
Ken Bloom

Reputation: 58780

In Ruby:

"aCamelCaseWord".split /(?=[[:upper:]])/
=> ["a", "Camel", "Case", "Word"]

I'm using positive lookahead here, so that I can split the string right before each uppercase letter. This lets me save any initial lowercase part as well.

Upvotes: 0

Related Questions