Tyllyn
Tyllyn

Reputation: 289

String Anglicization?

Is anyone aware of any simple way to anglicize a string? Currently, in our system, we're doing replacements on "invalid" characters, such as shown below:

        ret = ret.Replace("ä", "ae");
        ret = ret.Replace("Ä", "Ae");
        ret = ret.Replace("ß", "ss");
        ret = ret.Replace("ç", "c");
        ret = ret.Replace("Ç", "C");
        ret = ret.Replace("Ž", "Z");

The issue here is that as we're opening the business up in additional countries (Turkey, Russia, Hungary...), we're finding that there's a whole slew of characters that this process does not convert.

Is anyone aware of any sort of solution that would allow us to not depend on a table of "invalid" characters?

Also, if it helps, we're using C# to code. :)

Thanks!


edit:

In response to some comments, our system does support the full set of unicode characters... however, other system that we integrate to (such as card processors) do not. :(

Upvotes: 2

Views: 505

Answers (4)

revl
revl

Reputation: 161

I apologize for a shameless plug, but I couldn't resist. I once wrote a Python module that does exactly what the author of the post needed:

https://github.com/revl/anglicize

Because Python is almost as readable as pseudocode and the module is only about 125 lines long, it's relatively easy to rewrite it in C#.

Here's what the module produces given the input from the original post:

$ echo 'ä Ä ß ç Ç Ž' | anglicize
a A ss s S S

As you can see, "ß" was replaced with "ss" as requested, while "ç", "Ç", and "Ž" were replaced with "s", "S", and "S" respectively, likely because those were the phonetic equivalents in English.

As for "ä" and "Ä", the transliterations "ae" and "Ae" would probably work better than "a" and "A". I will gladly change the transliteration table if the linguists out there confirm that that's the right thing to do.

The module can transliterate the whole input text at once, or it can process input data in chunks. The documentation is in the README file that comes with the module.

Upvotes: 1

Amnon
Amnon

Reputation: 7772

As an answer to the modified problem (mail server supports only alphanumeric characters in usernames):

Let the users choose their own usernames, allowing only alphanumeric characters. They probably know best how to "anglicize" it.

Upvotes: 1

richardtallent
richardtallent

Reputation: 35374

Just because a letter looks similar to a traditional English letter does not make it equivalent. What is the business case for not just supporting Unicode and any characters your audience chooses to use?

Upvotes: 0

luvieere
luvieere

Reputation: 37514

Check out this question and its answers and take a look at this blog entry on converting diacritical characters to their ASCII equivalents.

Upvotes: 2

Related Questions