Import all letters of an alphabet of a certain language

Could it be possible to import all the possible letters (lowercase, uppercase, etc.) in an alphabet in a certain language (Turkish, Polish, Russian, etc.) as a python list? Is there a certain module to do that?

Upvotes: 3

Views: 1799

Answers (3)

Andj
Andj

Reputation: 1447

An old question, but something that needs to be explored in more detail. The core complexity is that what characters a language uses and what letters exist in its alphabet may be two different things. You have the added complexity a letter may consist of more than one character.

CLDR includes exemplar data in their locales. Exemplar data included the main exemplars, but also may contain auxiliary and index exemplars. Main exemplars are characters needed to write the language, and only lowercase are stored. Auxiliary exemplars are additional characters that could be found in text (also stored as lowercase). Index characters are the letters used for indexing or structuring dictionary data (these are uppercase).

I will show two approaches:

  1. Using PyICU
  2. Retrieving exemplar data from CLDR or SLDR.

PyICU:

The icu.LocaleData class exposes exemplar data via the getExemplarSet method.

The syntax would be:

icu.LocaleData(locale_id).getExemplarSet(case_option, exemplar_type)

So:

import icu
def get_exemplars(localeID, extype='main', option = 0):
    # Bitmask for options to apply to the exemplar pattern:
    #    0 -> retrieve the exemplar set as it is defined in the locale data
    #    2 -> retrieve a case-folded exemplar set (icu.USET_CASE_INSENSITIVE)
    #    4 -> retrieve a case-mapped exemplar set (icu.USET_ADD_CASE_MAPPINGS)
    option = option if option in [0, 2, 4] else 0
    extype = extype.lower() if extype.lower() in ['main', 'auxiliary', 'index'] else None
    localeID = localeID.replace('-', '_')
    if localeID in icu.Collator.getAvailableLocales():
        collator = icu.Collator.createInstance(icu.Locale(localeID))
    else:
        collator = icu.Collator.createInstance(icu.Locale.getRoot())
    # Type enumerations:
    #   icu.ULocaleDataExemplarSetType.ES_INDEX -> 0
    #   icu.ULocaleDataExemplarSetType.ES_AUXILIARY -> 1
    #   icu.ULocaleDataExemplarSetType.ES_INDEX -> 2
    types = {'main': 0, 'auxiliary': 1, 'index': 2}
    type = types[extype]
    if localeID not in icu.Locale.getAvailableLocales():
        raise ValueError(f'Specified Locale not available in icu4c {icu.ICU_VERSION}')
    return sorted(icu.LocaleData(localeID).getExemplarSet(option, type), key = collator.getSortKey)

For the main/standard exemplars for the Czech language:

get_exemplars('cs')
# ['a', 'á', 'b', 'c', 'č', 'd', 'ď', 'e', 'é', 'ě', 'f', 'g', 'h', 'ch', 
# 'i', 'í', 'j', 'k', 'l', 'm', 'n', 'ň', 'o', 'ó', 'p', 'q', 'r', 'ř', 
# 's', 'š', 't', 'ť', 'u', 'ú', 'ů', 'v', 'w', 'x', 'y', 'ý', 'z', 'ž']

Note that the Czech letter ch is present as a digraph, it doesn't get collapsed into c and h.

To get the auxiliary characters:

get_exemplars('cs', extype='auxiliary')
# ['à', 'ă', 'â', 'å', 'ä', 'ã', 'ā', 'æ', 'ç', 'è', 'ĕ', 'ê', 'ë', 'ē', 
# 'ì', 'ĭ', 'î', 'ï', 'ī', 'ľ', 'ł', 'ñ', 'ò', 'ŏ', 'ô', 'ö', 'ø', 'ō', 
# 'œ', 'ŕ', 'ù', 'ŭ', 'û', 'ü', 'ū', 'ÿ']

Auxiliary characters are usually from foreign (non-Czech) words and names that can be found in Czech text.

The index exemplars:

get_exemplars('cs', extype='index')
# ['A', 'B', 'C', 'Č', 'D', 'E', 'F', 'G', 'H', 'CH', 'I', 'J', 'K', 'L', 
# 'M', 'N', 'O', 'P', 'Q', 'R', 'Ř', 'S', 'Š', 'T', 'U', 'V', 'W', 'X', 
# 'Y', 'Z', 'Ž']

To illustrate the set options, I'll use German as an example, since its case mapping and case folding is more complex:

get_exemplars('de')
# ['a', 'ä', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
# 'n', 'o', 'ö', 'p', 'q', 'r', 's', 'ß', 't', 'u', 'ü', 'v', 'w', 'x', 
# 'y', 'z']

For case folded exemplars:

get_exemplars('de', option=2)
# ['a', 'A', 'ä', 'Ä', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'f', 'F', 
# 'g', 'G', 'h', 'H', 'i', 'I', 'j', 'J', 'k', 'K', 'K', 'l', 'L', 'm', 
# 'M', 'n', 'N', 'o', 'O', 'ö', 'Ö', 'p', 'P', 'q', 'Q', 'r', 'R', 
# 's', 'S', 'ſ', 'ss', 'ß', 'ẞ', 't', 'T', 'u', 'U', 'ü', 'Ü', 'v', 'V', 
# 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z']

And adding case mapping:

get_exemplars('de', option=4)
# ['a', 'A', 'ä', 'Ä', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'f', 'F', 
# 'g', 'G', 'h', 'H', 'i', 'I', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 
# 'n', 'N', 'o', 'O', 'ö', 'Ö', 'p', 'P', 'q', 'Q', 'r', 'R', 
# 's', 'S', 'ss', 'Ss', 'SS', 'ß', 't', 'T', 'u', 'U', 'ü', 'Ü', 'v', 'V', 
# 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z']

You could also use the constants rather than integer values:

get_exemplars('de', option=icu.USET_ADD_CASE_MAPPINGS)
# ['a', 'A', 'ä', 'Ä', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'f', 'F', 
# 'g', 'G', 'h', 'H', 'i', 'I', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 
# 'n', 'N', 'o', 'O', 'ö', 'Ö', 'p', 'P', 'q', 'Q', 'r', 'R', 
# 's', 'S', 'ss', 'Ss', 'SS', 'ß', 't', 'T', 'u', 'U', 'ü', 'Ü', 'v', 'V', 
# 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z']

Using CLDR and SLDR

It is possible to retrieve to retrieve the exemplar characters from the Common Locale Data Repository (CLDR), this is less flexible, since you'd need to add case mapping and case insensitive additions in your code. The data is stored in XML files using the Locale Data Markup Language (LDML).

Alternatively, LDML files can be retrieve from the SIL Locale Data Repository (SLDR).

def get_CLDR_exemplars(localeID, extype='main'):
    extype = extype.lower() if extype.lower() in ['main', 'auxiliary', 'index'] else None
    localeID = localeID.replace('-', '_')
    if localeID in icu.Collator.getAvailableLocales():
        locale = icu.Locale(localeID)
    else:
        locale = icu.Locale.getRoot()
    collator = icu.Collator.createInstance(locale)
    types = {'main': 0, 'auxiliary': 1, 'index': 2}
    type = types[extype]
    url = rf'https://raw.githubusercontent.com/unicode-org/cldr/main/common/main/{localeID}.xml'
    response = requests.get(url)
    if response.status_code in (404, 500):
        initial_letter = localeID[0]
        url = rf'https://raw.githubusercontent.com/silnrsi/sldr/refs/heads/master/sldr/{initial_letter}/{localeID}.xml'
        response = requests.get(url)
        if response.status_code in (404, 500):
            raise ValueError(f'Specified locale ({localeID}) not available in CLDR or SLDR')
    tree = ET.fromstring(response.text)
    result = tree.findall('characters/exemplarCharacters')
    for element in result:
        type = element.attrib.get('type', 'main')
        if type == extype and element.text and element.text != '↑↑↑':
            return sorted(list(icu.UnicodeSet(rf'{element.text}')), key = collator.getSortKey)
        else:
            raise ValueError(f'{extype.title()} exemplar data unavailable for the specified locale ({localeID}).')

The code will retrieve the LDML file from the CLDR Github repository. If not available in CLDR, it will fallback to the SLDR Github repository.

get_CLDR_exemplars('cs')
# ['a', 'á', 'b', 'c', 'č', 'd', 'ď', 'e', 'é', 'ě', 'f', 'g', 'h', 'ch', 
# 'i', 'í', 'j', 'k', 'l', 'm', 'n', 'ň', 'o', 'ó', 'p', 'q', 'r', 'ř', 
# 's', 'š', 't', 'ť', 'u', 'ú', 'ů', 'v', 'w', 'x', 'y', 'ý', 'z', 'ž']

get_CLDR_exemplars('cs', 'auxiliary')
# ['à', 'ă', 'â', 'å', 'ä', 'ã', 'ā', 'æ', 'ç', 'è', 'ĕ', 'ê', 'ë', 'ē', 
# 'ì', 'ĭ', 'î', 'ï', 'ī', 'ľ', 'ł', 'ñ', 'ò', 'ŏ', 'ô', 'ö', 'ø', 'ō', 
# 'œ', 'ŕ', 'ù', 'ŭ', 'û', 'ü', 'ū', 'ÿ']

get_CLDR_exemplars('din')
# ['a', 'ä', 'b', 'c', 'd', 'dh', 'e', 'ë', 'ɛ', 'ɛ̈', 'g', 'ɣ', 'i', 'ï', 
# 'j', 'k', 'l', 'm', 'n', 'nh', 'ny', 'ŋ', 'o', 'ö', 'ɔ', 'ɔ̈', 'p', 'r', 
# 't', 'th', 'u', 'w', 'y']

get_CLDR_exemplars('din', 'index')
# ['A', 'B', 'C', 'D', 'DH', 'E', 'Ɛ', 'G', 'Ɣ', 'I', 'J', 'K', 'L', 'M', 
# 'N', 'NH', 'NY', 'Ŋ', 'O', 'Ɔ', 'P', 'R', 'T', 'TH', 'U', 'W', 'Y']

Upvotes: 1

AndrewQ
AndrewQ

Reputation: 420

If I have understood your question, you want a list with all letters from an alphabet. A possible solution may be:

  • get a string with the full alphabet you need
  • use a set() to transform the string in a collection of unique, non ordered elements.

Then you can use the collection to do a lot of things, as explained in docs.python.org, section 5.4:

a = set('abracadabra')
b = set('alacazam')
a                                  # unique letters in a
   {'a', 'r', 'b', 'c', 'd'}
a - b                              # letters in a but not in b
   {'r', 'd', 'b'}
a | b                              # letters in a or b or both
   {'a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'}
a & b                              # letters in both a and b
   {'a', 'c'}
a ^ b                              # letters in a or b but not both
   {'r', 'd', 'b', 'm', 'z', 'l'}

Upvotes: -1

sophros
sophros

Reputation: 16728

Your question ties into a larger problem - how alphabets of certain languages are stored in a computer, how they are represented, and (eventually) how they can be retrieved in Python?

I suggest you read:

The short answer is - "yes". But it depends on what you actually name an alphabet of the language (e.g. in some languages there are specific characters for punctuation. do you consider them as part of the alphabet in your application?) What do you need it for? If it is about language detection then there is a duplicate question. Your question is generic and without details and (best) a snippet it will be difficult to be answered satisfactorily for you.

Upvotes: 1

Related Questions