Reputation: 19
I'm trying to deconstruct a list of Korean words into its alphabet components in C# using the "FormD" normalization:
var text = "루돌프사슴코";
String a = text.Normalize(NormalizationForm.FormD);
foreach (var c in a)
{
Console.Write(c + " ");
}
This results in Hangul jamo that still encode their compositional position:
루돌프사슴코
ㄹ ㅜ ㄷ ㅜ ㄹ ㅍ ㅡ ㅅ ㅏ ㅅ ㅡ ㅁ ㅋ ㅗ
ᄅ ᅮ ᄃ ᅩ ᆯ ᄑ ᅳ ᄉ ᅡ ᄉ ᅳ ᆷ ᄏ ᅩ
How do I convert the output so that it uses the "regular" jamo from the Hangul Compatibility Jamo Unicode block?
Upvotes: 1
Views: 135
Reputation: 120
First, you have to underestand how Unicode works. Here is a nice video about it.
After that, here is a Wikipedia article about this exact problem. And an article with reverse equations how to get characters from Unicode HERE.
My solution:
var text = "루돌프사슴코";
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(Encoding.Unicode.GetByteCount(text));
foreach (var word in text)
{
Console.WriteLine($"{word} => {(int)word} => (initial) × 588 + (medial) × 28 + (final) = {word - 44032}");
}
Console.WriteLine("END");
Console.WriteLine(text.Normalize(NormalizationForm.FormD));
String a = text.Normalize(NormalizationForm.FormD);
foreach (var c in a)
{
Console.Write(c);
}
var firstWord = '루';
var tail = ((int)firstWord - 44032) % 28;
var vowel = 1 + (((int)firstWord - 44032 - tail) % 588) /28;
var lead = 1 + ((int)firstWord - 44032) / 588;
Console.WriteLine();
var leadInUnicode = LeadUnicode(lead);
Console.WriteLine(leadInUnicode);
var vowelInUnicode = VowelUnicode(vowel);
Console.WriteLine(vowelInUnicode);
var tailInUnicode = TailUnicode(tail);
Console.WriteLine(tailInUnicode);
char LeadUnicode(int lead)
{
return (char)(lead + 4351);
}
char VowelUnicode(int vowel)
{
return (char)(vowel + 4448);
}
char TailUnicode(int tail)
{
return (char)(tail + 4520);
}
With output:
12
루 => 47336 => (initial) × 588 + (medial) × 28 + (final) = 3304
돌 => 46028 => (initial) × 588 + (medial) × 28 + (final) = 1996
프 => 54532 => (initial) × 588 + (medial) × 28 + (final) = 10500
사 => 49324 => (initial) × 588 + (medial) × 28 + (final) = 5292
슴 => 49844 => (initial) × 588 + (medial) × 28 + (final) = 5812
코 => 53076 => (initial) × 588 + (medial) × 28 + (final) = 9044
END
루돌프사슴코
루돌프사슴코
ᄅ
ᅮ
ᆨ
Upvotes: -1