Reputation: 2215
Is there a programmatic way (or some open-source repository), that given a language (say in 2-leters ISO format), return the letters of the alphabet of that language?
For example:
console.log(getAlphabet('en'));
outputs:
a b c d ...
and
console.log(getAlphabet('he'));
outputs:
א ב ג ד ...
Upvotes: 4
Views: 148
Reputation: 1480
Expanding on the other answer, which suggests to use Unicode CLDR data, while addressing some shortcomings:
Some languages have alphabets that include "letters" that are more than one JS character long, and where some letters are sets with several diacritics (for example Czech has {ch}
, as well as ií
and uúů
in exemplary characters), it would be more convenient to use "index" type exemplar characters, which include only characters used for indexing/searching (note that they are also capitalized). In most cases they split diacritics that are significantly different in the language's use to be considered a different letter. For cases where they don't (i.e. её
in Russian), it's best to use only the first variant. Some non-alphabetic languages like Chinese, have a very large set of exemplar characters, but are indexed using a way smaller set, which is likely the best one to use.
With that all in mind, here is a function that gets them and parses them, using a IETF language tag as an input, using the JSON-type CLDR data for simplicity:
async function getCharacters(languageTag) {
const req = await fetch(`https://raw.githubusercontent.com/unicode-org/cldr-json/refs/heads/main/cldr-json/cldr-misc-full/main/${languageTag}/characters.json`);
if (!req.ok) {
return null;
}
try {
const data = await req.json();
// comes in the form of "[A B C {CH} IÍ]"
const indexCharactersString = data.main?.[languageTag]?.characters?.index;
if (!indexCharactersString) {
return null;
}
const alphabetArray = indexCharactersString
// removes []
.substring(1, indexCharactersString.length - 1)
// split by space, after we have either single-character letters, multi-character letters in {}, or diacritic sets like ЕЁ
.split(" ")
// for {}-encased letters, return everything inside {}, for non-{} letters, it's either a character or a diacritic set, so we just take the first character
.map(char => char.startsWith("{") ? char.substring(1, char.length - 1) : char.substring(0,1));
return alphabetArray;
}
catch(e) {
return null;
}
}
getCharacters("cs").then(arr => console.log(arr));
Any valid language tag that exists in the CLDR will return the alphabet, and non-existing ones will return null
.
Upvotes: 3
Reputation: 17487
I don't think that a language always has a well-defined alphabet associated with it. But in the Unicode CLDR standard, the //ldml/characters/exemplarCharacters
seem to contain a "representative section" of letters typically used in a given language. This comes in an open-source repository, see here for Hebrew, for example.
Using an XML parser library, you can write a function that loads the file based on the language code (in the example above, https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml for language code he
) and locates the //ldml/characters/exemplarCharacters
element in it.
Below is an example function in client-side Javascript. It uses a regular expression with Unicode flag to split the exemplarCharacters
into individual letters, even if they are represented by more than one Javascript character.
fetch("https://raw.githubusercontent.com/unicode-org/cldr/HEAD/common/main/he.xml")
.then(r => r.text())
.then(function(xml) {
var dom = new DOMParser().parseFromString(xml, "text/xml");
console.log(dom.evaluate("/ldml/characters/exemplarCharacters[1]", dom, undefined, XPathResult.STRING_TYPE).stringValue
.match(/[^ \[\]]/gu));
});
Alternatively, you could evaluate /ldml/characters/exemplarCharacters[@type='index']
.
Upvotes: 8