Categorizing this Thai character using the .NET framework

Question

I'm trying to parse some Thai text according to the rules explained here http://www.thai-language.com/ref/spacing

Basically, I want to find strings of characters between whitespace and punctuation similar to how we would do in English. I realise that words themselves are not necessarily split by spaces in Thai, that's OK.

To parse the text I tried simply looping, like

while( Char.IsLetterOrDigit(theText[i++]) ) {}

to find the next character that isn't a letter or digit. That works except for certain characters like this one

which is the second character in this word (I think that's the character 'superscripting' the first character in the word).

This character doesn't seem to be categorized as anything by the Char class, ie:

Char.IsLowSurrogate((char)3657)
Char.IsPunctuation((char)3657)
Char.IsWhiteSpace((char)3657)
Char.IsSymbol((char)3657)
Char.IsSeparator((char)3657)
Char.IsDigit((char)3657)
Char.IsControl((char)3657)
Char.IsLetter((char)3657)
Char.IsSurrogate((char)3657)

all return false.

This character might be a 'tone' - how would that be identified using .NET?

Sami Kuhmonen · Accepted Answer

According to Unicode specifications the character is mai tho and is in the group “mark, nonspacing (Mn).”

You can use the Char.GetUnicodeCategory() method to check the type. For non-spacing marks the type is 5, or UnicodeCategory.NonSpacingMark

Categorizing this Thai character using the .NET framework

Answers (1)

Related Questions