Aaron Powell
Aaron Powell

Reputation: 25107

String normalisation

I'm writing some code which needs to do string normalisation, I want to turn a given string into a camel-case representation (well, to the best guess at least). Example:

"the quick brown fox" => "TheQuickBrownFox"
"the_quick_brown_fox" => "TheQuickBrownFox"
"123The_quIck bROWN FOX" => "TheQuickBrownFox"
"the_quick brown fox 123" => "TheQuickBrownFox123"
"thequickbrownfox" => "Thequickbrownfox"

I think you should be able to get the idea from those examples. I want to strip out all special characters (', ", !, @, ., etc), capitalise every word (words are defined by a space, _ or -) and any leading numbers dropped (trailing/ internal are ok, but this requirement isn't vital, depending on the difficulty really).

I'm trying to work out what would be the best way to achieve this. My first guess would be with a regular expression, but my regex skills are bad at best so I wouldn't really know where to start.

My other idea would be to loop and parse the data, say break it down into words, parse each one, and rebuilt the string that way.

Or is there another way in which I could go about it?

Upvotes: 0

Views: 914

Answers (5)

ben_h
ben_h

Reputation: 4694

You could wear ruby slippers to work :)

def camelize str
  str.gsub(/^[^a-zA-z]*/, '').split(/[^a-zA-Z0-9]/).map(&:capitalize).join
end

Upvotes: 1

thomasrutter
thomasrutter

Reputation: 117403

Any solution that involves matching particular characters may not work well with some character encodings, particularly if Unicode representation is being used, which has dozens of space characters, thousands of 'symbols', thousands of punctuation characters, thousands of 'letters', etc. It would be better where-ever possible to use built-in Unicode-aware functions. In terms of what is a 'special character', well you could decide based on Unicode categories. For instance, it would include 'Punctuation' but would it include 'Symbols'?

ToLower(), IsLetter(), etc should be fine, and take into account all possible letters in Unicode. Matching against dashes and slashes should probably take into account some of the dozens of space and dash characters in Unicode.

Upvotes: 1

Mitch Wheat
Mitch Wheat

Reputation: 300719

How about a simple solution using Strings.StrConv in the Microsoft.VisualBasic namespace? (Don't forget to add a Project Reference to Microsoft.VisualBasic):

using System;
using VB = Microsoft.VisualBasic;


namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine(VB.Strings.StrConv("QUICK BROWN", VB.VbStrConv.ProperCase, 0));
            Console.ReadLine();
        }
    }
}

Upvotes: 3

configurator
configurator

Reputation: 41660

This regex matches all words. Then, we Aggregate them with a method that capitalizes the first chars, and ToLowers the rest of the string.

Regex regex = new Regex(@"[a-zA-Z]*", RegexOptions.Compiled);

private string CamelCase(string str)
{
    return regex.Matches(str).OfType<Match>().Aggregate("", (s, match) => s + CamelWord(match.Value));
}

private string CamelWord(string word)
{
    if (string.IsNullOrEmpty(word))
        return "";

    return char.ToUpper(word[0]) + word.Substring(1).ToLower();
}

This method ignores numbers, by the way. To Add them, you can change the regex to @"[a-zA-Z]*|[0-9]*", I suppose - but I haven't tested it.

Upvotes: 1

John Boker
John Boker

Reputation: 83719

thought it'd be fun to try it, here's what i came up with:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace ConsoleApplication2
{
    class Program
    {
        static void Main(string[] args)
        {
            StringBuilder sb = new StringBuilder();
            string sentence = "123The_quIck bROWN FOX1234";

            sentence = sentence.ToLower();

            char[] s = sentence.ToCharArray();

            bool atStart = true;
            char pChar = ' ';

            char[] spaces = { ' ', '_', '-' };
            char a;
            foreach (char c in s)
            {
                if (atStart && char.IsDigit(c)) continue;

                if (char.IsLetter(c))
                {
                    a = c;
                    if (spaces.Contains(pChar))
                        a = char.ToUpper(a);
                    sb.Append(a);
                    atStart = false;
                }
                else if(char.IsDigit(c))
                {
                    sb.Append(c);
                }
                pChar = c;
            }

            Console.WriteLine(sb.ToString());
            Console.ReadLine();
        }
    }
}

Upvotes: 0

Related Questions