Reputation: 114046
I don't know the exact technical terminology, but UTF-8 as a standard includes characters from certain language groupings, which can be observed in the Windows Character Map with a font like Arial Unicode MS.
How do I obtain a list of the characters under each set? This could be an API or just a plain list/DB somewhere on the net. I found the wiki article that lists everything, but not in an iterable form. Any ideas?
Upvotes: 6
Views: 2852
Reputation: 114046
You can access the entire list of unicode chars at the published UnicodeData.txt which is a CSV formatted file listing every character with group information.
The third column specifies the character class, in a 2 digit shortform, longforms specified here.
letter-character
-- classes Lu, Ll, Lt, Lm, Lo, or Nlcombining-character
-- classes Mn or Mcdecimal-digit-character
-- class Ndconnecting-character
-- class Pc formatting-character
-- class Cf Its even possible to iterate through chars of a certain group using C# LINQ:
var charInfo = Enumerable.Range(0, 0x110000)
.Where(x => x < 0x00d800 || x > 0x00dfff)
.Select(char.ConvertFromUtf32)
.GroupBy(s => char.GetUnicodeCategory(s, 0))
.ToDictionary(g => g.Key);
foreach (var ch in charInfo[UnicodeCategory.LowercaseLetter])
{
Console.Write(ch);
}
However, the language grouping is not explicitly mentioned so you'll have to parse the first word of the name to group each char by language. This is the most reliable method to do so, since every Latin unicode character begins with the prefix "Latin". Examples follow:
Upvotes: 6