Reputation: 12554
I am developing a heuristic for automatic language detection and would like to find out whether the given letter has diacritics (like "Ðàäèî Êóëüòóðà" -- all letters have diacritics). It would be best if I could also get the type of diacritic, if possible.
I browsed through UnicodeCategory
enum but didn't find anything that could help me here.
Upvotes: 7
Views: 8173
Reputation: 323
Try this:
public bool CheckIsStringContainDiacriticsCharacter(string text)
{
bool IsDiacriticsCharacter = false;
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
else
{
IsDiacriticsCharacter = true;
break;
}
}
return IsDiacriticsCharacter;
}
Upvotes: 1
Reputation: 108880
One possible way is to normalize it to a form where letters and their diacritics are written as several codepoints. Then check if you have a letter followed by accents.
Adapting from How do I remove diacritics (accents) from a string in .NET?, you can normalize with Normalize(NormalizationForm.FormD)
and check for the diacritics with UnicodeCategory.NonSpacingMark
.
bool IsLetterWithDiacritics(char c)
{
var s = c.ToString().Normalize(NormalizationForm.FormD);
return (s.Length > 1) &&
char.IsLetter(s[0]) &&
s.Skip(1).All(c2 => CharUnicodeInfo.GetUnicodeCategory(c2) == UnicodeCategory.NonSpacingMark);
}
Upvotes: 16