Reputation: 16891
I am looking for a way to split PascalCase strings, e.g. "MyString", into separate words - "My", "String". Another user posed the question for bash
, but I want to know how to do it with general regular expressions or at least in .NET.
Bonus if you can find a way to also split (and optionally capitalize) camelCase strings: e.g. "myString" becomes "my" and "String", with the option to capitalize/lowercase either or both of the strings.
Upvotes: 20
Views: 13976
Reputation: 31
string.Concat(str.Select(x => Char.IsUpper(x) ? " " + x : x.ToString())).TrimStart(' ').Dump();
This is far better approach then using Regex, Dump is just to print to console
Upvotes: 1
Reputation: 151
public static string PascalCaseToSentence(string input)
{
if (input == null) return "";
string output = Regex.Replace(input, @"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[A-Za-z])(?=[^A-Za-z])", m => " " + m.Value);
return output;
}
Based on Shimmy's answer.
Upvotes: 0
Reputation: 4876
with the aims of
I therefore created the following (non regex, verbose, but performance oriented) function
public static string ToSeparateWords(this string value)
{
if (value==null){return null;}
if(value.Length <=1){return value;}
char[] inChars = value.ToCharArray();
List<int> uCWithAnyLC = new List<int>();
int i = 0;
while (i < inChars.Length && char.IsUpper(inChars[i])) { ++i; }
for (; i < inChars.Length; i++)
{
if (char.IsUpper(inChars[i]))
{
uCWithAnyLC.Add(i);
if (++i < inChars.Length && char.IsUpper(inChars[i]))
{
while (++i < inChars.Length)
{
if (!char.IsUpper(inChars[i]))
{
uCWithAnyLC.Add(i - 1);
break;
}
}
}
}
}
char[] outChars = new char[inChars.Length + uCWithAnyLC.Count];
int lastIndex = 0;
for (i=0;i<uCWithAnyLC.Count;i++)
{
int currentIndex = uCWithAnyLC[i];
Array.Copy(inChars, lastIndex, outChars, lastIndex + i, currentIndex - lastIndex);
outChars[currentIndex + i] = ' ';
lastIndex = currentIndex;
}
int lastPos = lastIndex + uCWithAnyLC.Count;
Array.Copy(inChars, lastIndex, outChars, lastPos, outChars.Length - lastPos);
return new string(outChars);
}
What was most surprising was the performance tests. using 1 000 000 iterations per function
regex pattern used = "([a-z](?=[A-Z])|[A-Z](?=[A-Z][a-z]))"
test string = "TestTLAContainingCamelCase":
static regex: 13 302ms
Regex instance: 12 398ms
compiled regex: 12 663ms
brent(above): 345ms
AndyRose: 1 764ms
DanTao: 995ms
the Regex instance method was only slightly faster than the static method, even over a million iterations (and I can't see the benefit of using the RegexOptions.Compiled flag), and Dan Tao's very succinct code was almost as fast as my much less clear code!
Upvotes: 3
Reputation: 16984
Just to provide an alternative to the RegEx and looping solutions all ready provided here is an answer using LINQ which also handles camel case and acronyms:
string[] testCollection = new string[] { "AutomaticTrackingSystem", "XSLT", "aCamelCaseWord" };
foreach (string test in testCollection)
{
// if it is not the first character and it is uppercase
// and the previous character is not uppercase then insert a space
var result = test.SelectMany((c, i) => i != 0 && char.IsUpper(c) && !char.IsUpper(test[i - 1]) ? new char[] { ' ', c } : new char[] { c });
Console.WriteLine(new String(result.ToArray()));
}
The output from this is:
Automatic Tracking System
XSLT
a Camel Case Word
Upvotes: 14
Reputation: 8962
See this question: Is there a elegant way to parse a word and add spaces before capital letters? Its accepted answer covers what you want, including numbers and several uppercase letters in a row. While this sample has words starting in uppercase, it it equally valid when the first word is in lowercase.
string[] tests = {
"AutomaticTrackingSystem",
"XMLEditor",
"AnXMLAndXSLT2.0Tool",
};
Regex r = new Regex(
@"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[A-Za-z])(?=[^A-Za-z])"
);
foreach (string s in tests)
r.Replace(s, " ");
The above will output:
[Automatic][Tracking][System]
[XML][Editor]
[An][XML][And][XSLT][2.0][Tool]
Upvotes: 29
Reputation: 34347
Check that a non-word character comes at the beginning of your regex with \W
and keep the individual strings together, then split the words.
Something like: \W([A-Z][A-Za-z]+)+
For: sdcsds sd aCamelCaseWord as dasd as aSscdcacdcdc PascelCase DfsadSsdd sd
Outputs:
48: PascelCase
59: DfsadSsdd
Upvotes: 0
Reputation: 16891
var regex = new Regex("([A-Z]+[^A-Z]+)");
var matches = regex.Matches("aCamelCaseWord")
.Cast<Match>()
.Select(match => match.Value);
foreach (var element in matches)
{
Console.WriteLine(element);
}
Prints
Camel
Case
Word
(As you can see, it doesn't handle camelCase - it dropped the leading "a".)
Upvotes: 1
Reputation: 128337
How about:
static IEnumerable<string> SplitPascalCase(this string text)
{
var sb = new StringBuilder();
using (var reader = new StringReader(text))
{
while (reader.Peek() != -1)
{
char c = (char)reader.Read();
if (char.IsUpper(c) && sb.Length > 0)
{
yield return sb.ToString();
sb.Length = 0;
}
sb.Append(c);
}
}
if (sb.Length > 0)
yield return sb.ToString();
}
Upvotes: 5
Reputation: 16891
Answered in a different question:
void Main()
{
"aCamelCaseWord".ToFriendlyCase().Dump();
}
public static class Extensions
{
public static string ToFriendlyCase(this string PascalString)
{
return Regex.Replace(PascalString, "(?!^)([A-Z])", " $1");
}
}
Outputs a Camel Case Word
(.Dump()
just prints to the console).
Upvotes: 9
Reputation: 58780
In Ruby:
"aCamelCaseWord".split /(?=[[:upper:]])/
=> ["a", "Camel", "Case", "Word"]
I'm using positive lookahead here, so that I can split the string right before each uppercase letter. This lets me save any initial lowercase part as well.
Upvotes: 0