Reputation: 289
Is anyone aware of any simple way to anglicize a string? Currently, in our system, we're doing replacements on "invalid" characters, such as shown below:
ret = ret.Replace("ä", "ae");
ret = ret.Replace("Ä", "Ae");
ret = ret.Replace("ß", "ss");
ret = ret.Replace("ç", "c");
ret = ret.Replace("Ç", "C");
ret = ret.Replace("Ž", "Z");
The issue here is that as we're opening the business up in additional countries (Turkey, Russia, Hungary...), we're finding that there's a whole slew of characters that this process does not convert.
Is anyone aware of any sort of solution that would allow us to not depend on a table of "invalid" characters?
Also, if it helps, we're using C# to code. :)
Thanks!
edit:
In response to some comments, our system does support the full set of unicode characters... however, other system that we integrate to (such as card processors) do not. :(
Upvotes: 2
Views: 505
Reputation: 161
I apologize for a shameless plug, but I couldn't resist. I once wrote a Python module that does exactly what the author of the post needed:
https://github.com/revl/anglicize
Because Python is almost as readable as pseudocode and the module is only about 125 lines long, it's relatively easy to rewrite it in C#.
Here's what the module produces given the input from the original post:
$ echo 'ä Ä ß ç Ç Ž' | anglicize
a A ss s S S
As you can see, "ß" was replaced with "ss" as requested, while "ç", "Ç", and "Ž" were replaced with "s", "S", and "S" respectively, likely because those were the phonetic equivalents in English.
As for "ä" and "Ä", the transliterations "ae" and "Ae" would probably work better than "a" and "A". I will gladly change the transliteration table if the linguists out there confirm that that's the right thing to do.
The module can transliterate the whole input text at once, or it can process input data in chunks. The documentation is in the README file that comes with the module.
Upvotes: 1
Reputation: 7772
As an answer to the modified problem (mail server supports only alphanumeric characters in usernames):
Let the users choose their own usernames, allowing only alphanumeric characters. They probably know best how to "anglicize" it.
Upvotes: 1
Reputation: 35374
Just because a letter looks similar to a traditional English letter does not make it equivalent. What is the business case for not just supporting Unicode and any characters your audience chooses to use?
Upvotes: 0
Reputation: 37514
Check out this question and its answers and take a look at this blog entry on converting diacritical characters to their ASCII equivalents.
Upvotes: 2