user2349053
user2349053

Reputation: 41

Convert language characters to Latin alphabet

I try to program an application to learn foreign characters. If you for example want to learn Japanese, then you'd have to memorize all the Hiragana, Katakana and Kanji letters. (e.g あ、い、か... = Hiragana カ、サ、ケ... = Katakana 本、学... = Kanji).

Example: Some user is trying to learn Japanese. So he has to learn: か = ka
本 = hon, meaning: basis/ book/ this

And he also has to learn the pronunciation.

My first question is if there is any library or something to easily do it in .net? I also looked at Microsoft IME, but I couldn't really find out how I could use it in my project.

I also looked at the Unicode database and it's basically possible to it. I also managed to write a Project to convert か to ka. The only Thing that is missing are the meanings (for example 本=basis/ book/ this), which are also provided by the Unicode database. I unfortunately couldn't find them in my .XML file from which I get the UCD data. It works, when I enter the word on the UCD-Website.

Another approach was to use the CLDR-Library which also seems to be related to UCD. I unfortunately couldn't figure out which of the two (UCD or CLDR) I should use. CLDR: http://cldr.unicode.org/

My question is if UCD is the best way to do it and if I could also use CLDR.

I don't really want to work with normal lists where I just type in all the characters myself. It would take too much time, especially for all the Kanji letters (more than 10,000).

Thanks

EDIT: I solved it, I extract the information from the Unicode Character Database (UCD). You can download the whole database in a .XML file. I just needed to learn how to handle it and find the correct attributes.

Upvotes: 3

Views: 3340

Answers (2)

phil soady
phil soady

Reputation: 11328

Both Google and Microsoft offer APIs you can call to translate text. eg http://www.microsoft.com/en-us/translator/translatorapi.aspx

Depending the type of service you choose a small fee might be required. They also offer sounds for the translation. No need to re-invent this wheel. :-)

If this was a CodePage type question. this blog is an amusing place to start http://www.joelonsoftware.com/articles/Unicode.html

EDIT: in response to comment about options. Google can supply several possible translations

eg for 本

enter image description here

Upvotes: 2

Stefan Steiger
Stefan Steiger

Reputation: 82176

What you are looking for is a Transliteration API or library.
Well, actually, what you want is a Romanization library, which is not quite the same, but you better forget i said that, you'll find out soon enough, and I don't want to shatter your daydreams.

You might want to look at this https://bitbucket.org/Dimps/unidecodesharpfork
or this http://unidecode.codeplex.com/
or this http://transliterator.codeplex.com/

I used unidecodesharpfork to transliterate Russian, and it's somewhat unsatisfactory, as it only transliterates each character, it doesn't properly romanize according to ISO standard.

Unfortunately, "transliteration" (what you actually need is romanization, so by transliteration i/you mean romanization) isn't quite as simple as having a list of characters in one alphabet, and then substitute each character with the corresponding character in another alphabet, which seems to be the basic belief of the unidecodesharpfork author.

There are rules, because sometimes transliteration depends on the preceding or following character, and there is also an ISO Standard on Romanization, e.g. for Russian (see http://en.wikipedia.org/wiki/Romanization_of_Russian).

Also, transliteration isn't culture-independant. For example, if you are a German-Speaker, you transliterate Russian differently than an English-Speaker does.

Therefore, for serious usage, I would use the Google transliterate API (provides English-Speaker standpoint only), but i just see it has been deprecated. https://developers.google.com/transliterate/

Maybe high time to read out the transliteration for those 10'000 characters :)

Upvotes: 1

Related Questions