Jim W
Jim W

Reputation: 5016

Categorizing this Thai character using the .NET framework

I'm trying to parse some Thai text according to the rules explained here http://www.thai-language.com/ref/spacing

Basically, I want to find strings of characters between whitespace and punctuation similar to how we would do in English. I realise that words themselves are not necessarily split by spaces in Thai, that's OK.

To parse the text I tried simply looping, like

while( Char.IsLetterOrDigit(theText[i++]) ) {}

to find the next character that isn't a letter or digit. That works except for certain characters like this one

Thai character

which is the second character in this word (I think that's the character 'superscripting' the first character in the word).

Thai word

This character doesn't seem to be categorized as anything by the Char class, ie:

Char.IsLowSurrogate((char)3657)
Char.IsPunctuation((char)3657)
Char.IsWhiteSpace((char)3657)
Char.IsSymbol((char)3657)
Char.IsSeparator((char)3657)
Char.IsDigit((char)3657)
Char.IsControl((char)3657)
Char.IsLetter((char)3657)
Char.IsSurrogate((char)3657)

all return false.

This character might be a 'tone' - how would that be identified using .NET?

Upvotes: 0

Views: 786

Answers (1)

Sami Kuhmonen
Sami Kuhmonen

Reputation: 31153

According to Unicode specifications the character is mai tho and is in the group “mark, nonspacing (Mn).”

You can use the Char.GetUnicodeCategory() method to check the type. For non-spacing marks the type is 5, or UnicodeCategory.NonSpacingMark

Upvotes: 2

Related Questions