user3434046
user3434046

Reputation: 19

How to deconstruct Hangul words into its alphabet components?

I'm trying to deconstruct a list of Korean words into its alphabet components in C# using the "FormD" normalization:

var text = "루돌프사슴코";
String a = text.Normalize(NormalizationForm.FormD);
foreach (var c in a)
{
    Console.Write(c + " ");
}

This results in Hangul jamo that still encode their compositional position:

How do I convert the output so that it uses the "regular" jamo from the Hangul Compatibility Jamo Unicode block?

Upvotes: 1

Views: 135

Answers (1)

Radek Sedlář
Radek Sedlář

Reputation: 120

First, you have to underestand how Unicode works. Here is a nice video about it.

After that, here is a Wikipedia article about this exact problem. And an article with reverse equations how to get characters from Unicode HERE.

My solution:

var text = "루돌프사슴코";
Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine(Encoding.Unicode.GetByteCount(text));
foreach (var word in text)
{
    Console.WriteLine($"{word} => {(int)word} => (initial) × 588 + (medial) × 28 + (final) = {word - 44032}");
}
Console.WriteLine("END");
Console.WriteLine(text.Normalize(NormalizationForm.FormD));

String a = text.Normalize(NormalizationForm.FormD);
foreach (var c in a)
{
    Console.Write(c);
}

var firstWord = '루';
var tail = ((int)firstWord - 44032) % 28;
var vowel = 1 + (((int)firstWord - 44032 - tail) % 588) /28;
var lead = 1 + ((int)firstWord - 44032) / 588;
Console.WriteLine();
var leadInUnicode = LeadUnicode(lead);
Console.WriteLine(leadInUnicode);
var vowelInUnicode = VowelUnicode(vowel);
Console.WriteLine(vowelInUnicode);
var tailInUnicode = TailUnicode(tail);
Console.WriteLine(tailInUnicode);


char LeadUnicode(int lead)
{
    return (char)(lead + 4351);
}

char VowelUnicode(int vowel)
{
    return (char)(vowel + 4448);
}

char TailUnicode(int tail)
{
    return (char)(tail + 4520);
}

With output:

12
루 => 47336 => (initial) × 588 + (medial) × 28 + (final) = 3304
돌 => 46028 => (initial) × 588 + (medial) × 28 + (final) = 1996
프 => 54532 => (initial) × 588 + (medial) × 28 + (final) = 10500
사 => 49324 => (initial) × 588 + (medial) × 28 + (final) = 5292
슴 => 49844 => (initial) × 588 + (medial) × 28 + (final) = 5812
코 => 53076 => (initial) × 588 + (medial) × 28 + (final) = 9044
END
루돌프사슴코
루돌프사슴코
ᄅ
ᅮ
ᆨ

Upvotes: -1

Related Questions