SaguiItay
SaguiItay

Reputation: 2215

Given a language, how to get its alphabet letters

Is there a programmatic way (or some open-source repository), that given a language (say in 2-leters ISO format), return the letters of the alphabet of that language?

For example:

console.log(getAlphabet('en'));

outputs:

a b c d ... 

and

console.log(getAlphabet('he'));

outputs:

א ב ג ד ... 

Upvotes: 4

Views: 148

Answers (2)

goose_lake
goose_lake

Reputation: 1480

Expanding on the other answer, which suggests to use Unicode CLDR data, while addressing some shortcomings:

Some languages have alphabets that include "letters" that are more than one JS character long, and where some letters are sets with several diacritics (for example Czech has {ch}, as well as and uúů in exemplary characters), it would be more convenient to use "index" type exemplar characters, which include only characters used for indexing/searching (note that they are also capitalized). In most cases they split diacritics that are significantly different in the language's use to be considered a different letter. For cases where they don't (i.e. её in Russian), it's best to use only the first variant. Some non-alphabetic languages like Chinese, have a very large set of exemplar characters, but are indexed using a way smaller set, which is likely the best one to use.

With that all in mind, here is a function that gets them and parses them, using a IETF language tag as an input, using the JSON-type CLDR data for simplicity:

async function getCharacters(languageTag) {
    const req = await fetch(`https://raw.githubusercontent.com/unicode-org/cldr-json/refs/heads/main/cldr-json/cldr-misc-full/main/${languageTag}/characters.json`);
    if (!req.ok) {
        return null;
    }
    try {
        const data = await req.json();
        // comes in the form of "[A B C {CH} IÍ]"
        const indexCharactersString = data.main?.[languageTag]?.characters?.index;
        if (!indexCharactersString) {
            return null;
        }
        const alphabetArray = indexCharactersString
            // removes []
            .substring(1, indexCharactersString.length - 1)
            // split by space, after we have either single-character letters, multi-character letters in {}, or diacritic sets like ЕЁ
            .split(" ")
            // for {}-encased letters, return everything inside {}, for non-{} letters, it's either a character or a diacritic set, so we just take the first character
            .map(char => char.startsWith("{") ? char.substring(1, char.length - 1) : char.substring(0,1));
        return alphabetArray;
    }
    catch(e) {
        return null;
    }
}
getCharacters("cs").then(arr => console.log(arr));

Any valid language tag that exists in the CLDR will return the alphabet, and non-existing ones will return null.

Upvotes: 3

Heiko Theißen
Heiko Theißen

Reputation: 17487

I don't think that a language always has a well-defined alphabet associated with it. But in the Unicode CLDR standard, the //ldml/characters/exemplarCharacters seem to contain a "representative section" of letters typically used in a given language. This comes in an open-source repository, see here for Hebrew, for example.

Using an XML parser library, you can write a function that loads the file based on the language code (in the example above, https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml for language code he) and locates the //ldml/characters/exemplarCharacters element in it.

Below is an example function in client-side Javascript. It uses a regular expression with Unicode flag to split the exemplarCharacters into individual letters, even if they are represented by more than one Javascript character.

fetch("https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml")
  .then(r => r.text())
  .then(function(xml) {
    var dom = new DOMParser().parseFromString(xml, "text/xml");
    console.log(dom.evaluate("/ldml/characters/exemplarCharacters[1]", dom, undefined, XPathResult.STRING_TYPE).stringValue
    .match(/[^ \[\]]/gu));
  });

Alternatively, you could evaluate /ldml/characters/exemplarCharacters[@type='index'].

Upvotes: 8

Related Questions