Reputation: 5016
I'm trying to parse some Thai text according to the rules explained here http://www.thai-language.com/ref/spacing
Basically, I want to find strings of characters between whitespace and punctuation similar to how we would do in English. I realise that words themselves are not necessarily split by spaces in Thai, that's OK.
To parse the text I tried simply looping, like
while( Char.IsLetterOrDigit(theText[i++]) ) {}
to find the next character that isn't a letter or digit. That works except for certain characters like this one
which is the second character in this word (I think that's the character 'superscripting' the first character in the word).
This character doesn't seem to be categorized as anything by the Char class, ie:
Char.IsLowSurrogate((char)3657)
Char.IsPunctuation((char)3657)
Char.IsWhiteSpace((char)3657)
Char.IsSymbol((char)3657)
Char.IsSeparator((char)3657)
Char.IsDigit((char)3657)
Char.IsControl((char)3657)
Char.IsLetter((char)3657)
Char.IsSurrogate((char)3657)
all return false
.
This character might be a 'tone' - how would that be identified using .NET?
Upvotes: 0
Views: 786
Reputation: 31153
According to Unicode specifications the character is mai tho and is in the group “mark, nonspacing (Mn).”
You can use the Char.GetUnicodeCategory()
method to check the type. For non-spacing marks the type is 5, or UnicodeCategory.NonSpacingMark
Upvotes: 2