Unicode mapping to languages

Question

This question is probably borderline for stack overflow, so I apologize in advance if it seems overly off-topic. I'm writing a program that involves many languages and I'm in need of a table which maps languages to Unicode points. Those of you familiar with Unicode will know that characters are divided up in 'blocks' such as Latin, Cyrillic, etc. Of course, most languages which use Latin characters do not use all the Latin characters, and most languages which use Cyrillic characters do not use all the Cyrillic characters, etc. I'm interested in a table that maps English only to those characters used in English, Spanish to only those characters used in Spanish, etc. There's no need to cover every language in the world (as this would be nearly impossible) but at least some of the more common languages. (Even then, this would be a fairly extensive table involving many-to-many relationships.) I'm not sure that such a table exists. (If it doesn't, I may turn this into an open-source project, as it would be very useful for me and possibly for others.)

Jukka K. Korpela · Accepted Answer

CLDR, the Unicode Common Locale Data Repository, contains definitions for character collections for a large number of languages. The exemplarCharacters element specifies the characters needed for normal writing of words of the language. Current definitions for this element can be seen on the By-Type Chart: misc.exemplarCharacters page (grouped by writing system), but for automated processing, you may find the XML files more suitable. The exemplarCharacters-other element currently contains similar data for punctuation characters.

That’s probably the best available compilation of such information in general, but it is conceptually very vague (it does not really try to define what it means to be a character used to write a language), and the information for different languages has been collected in a process that is open but does not contain general quality control.

The meanings of the elements are defined in the LDML specification, clause 5.6 Character Elements. Note the description “The element provides optional information about characters that are in common use in the locale, and information that can be helpful in picking resources or data appropriate for the locale, such as when choosing among character encodings that are typically used to transmit data in the language of the locale.” This is a rather strange viewpoint, especially in a Unicode Consortium document, since we can use UTF-8, which covers all languages. But there are other issues where the information about characters used in a language could be useful, like the selection of a font for text, or preliminary checking of input data, or setting parameters for OCR scanning, or defining keyboard setups. These contexts may well require different definitions for the concept “characters used in a language”.

Unicode mapping to languages

Answers (1)

Related Questions