Reputation: 38949

Detecting which alphabet characters belong to in Python

Is there a library or other simple way to detect which alphabet characters belong to in Python? I know I can use unicode code ranges for this, but if there's already a built-in way or a library or some such that provides the mappings, I'd rather not reinvent the wheel.

Note: I'm asking about alphabet not language. Both "hello" and "hola" would map to Latin alphabet, whereas "Поиск" would map to Cyrillic.

Upvotes: 5

Answers (3)

Andj

Reputation: 1374

An alternative approach.

There are no core Python methods that provide information on scripts. It is necessary to look at alternatives. I'd suggest either unicodedataplus or PyICU.

unicodedataplus is a drop in replacement for unicodedata which is kept up to date, and provides additional methods, two of which are relevant: unicodedataplus.script() and unicodedataplus.script() extension.

The version of Unicode supported by the Python version you are using can be found using:

import unicodedata
print(unicodedata.unidata_version)
# 15.0.0

Python 3.12 will be using Unicode version 15.0. It is important to realise that the current versions of unicodedataplus and icu4c which is used by PyICU may not match the version of Unicode used by your Python install, so scripts not supported by your version of Python may be reported as available.

It is possible to install a specific version of unicodedataplus, the package's versioning number matches the Unicode version supported. Refer to unicodedataplus release history for available versions. So for Python 3.12, if we want to match Python's Unicode version, we'd install version 15.0.0.post2 for unicodedataplus:

pip install unicodedataplus==15.0.0.post2

The script of a specific character

All Unicode characters are assigned to a script, and unicodedataplus.script() will return the name of that script:

import unicodedataplus as ud
print(ud.script('𖫐'))
# Bassa_Vah

We could also loop through a string and get all scripts in a string:

example = "La sottise, l'erreur, le péché, la lésine, occupent nos esprits et travaillent nos corps"
detected_scripts = set()
for char in example:
    detected_scripts.add(ud.script(char))
print(detected_scripts)
# {'Latin', 'Common'}

For a string longer than one word, the code will return more than one script. IN Unicode , and likewise in ISO 15924 there are three special scripts:

Common: characters used in more than one script, such as whitespace, punctuation and the digits 0-9.
Inherited: characters such as ZWJ and ZWNJ, and combining marks that are used in more than one script and inherit their script value from their base character.
Unknown: Surrogate codepoints, Private Use Area codepoints, noncharacters and reserved codepoints.

We could rewrite the code above to:

detected_scripts = set()
for char in example:
    s = ud.script(char)
    if s not in ["Common", "Inherited", "Unknown"]:
        detected_scripts.add(ud.script(char))
print(detected_scripts)
# {'Latin'}

Supported scripts

It is also possible to get a list of scripts supported by Python:

scripts = set()
for i in range(int('10FFFF', 16) + 1):
    s = ud.script(chr(i))
    if s not in ["Common", "Inherited", "Unknown"]:
       scripts.add(s)
available_scripts = sorted(scripts)
print(available_scripts)

Characters in a script

Identifying what characters are in a script can be problematic. Available approaches and data will not report what Common and Inherited script characters are required in any specific script. The best we can do is to identify characters by script property.

Additionally, characters in many scripts may be non-contiguous, so constructing ranges would require processing Unicode data and building a dataset, data that would need to be undated with each Unicode release.

Assuming the Unicode versions used by Python and icu4c match, we can get a list of characters in that script, for use in other functions. It is possible to create sets of characters using icu.UnicodeSet(), using either POSIX or Perl notations with name of script or script code, and in short or long form:

[:Arabic:]
[:Arab:]
[\p{Arabic}]
[\p{Arab}]
[\p{sc=Arab}]
[\p{sc=Arabic}]
[\p{Script=Arabic}]
[\p{Script=Arab}]
[\p{sc:Arab}]
[\p{sc:Arabic}]
[\p{Script:Arabic}]
[\p{Script:Arab}]

import icu
script = "Latn"
script_characters = list(icu.UnicodeSet(f'\\p\u007B{script}\u007D'))

Script extensions

unicodedataplus and pyicu also allow you to check for script extensions, some characters characters are used in more than one script, and script extensions contains data on some of these characters.

print(ud.script("\u064E"))
# Inherited
print(ud.script_extensions("\u064E"))
# ['Arab', 'Syrc']

N.B. unicodedataplus.script_extensions() returns a list of ISO 15924 four letter codes, rather than the script name.

In the above example fatha belongs to the Inherited script, but has script extensions to Arabic and Syriac scripts.

Script extension data is imcomplete, there are efforts underway within SAH and UTC to expand the script extension data.

Upvotes: 0

GaspardP

Reputation: 4822

The closest I could find to solve this was to use https://pypi.org/project/uniscripts/ which has not been updated in years but has the right approach by pulling the scripts from the unicode standard.

I updated to uniscripts to unicode 15.1 and submitted a merge request to the package maintainer. Meanwhile you can use it from my repository:

pip install git+https://github.com/gaspardpetit/uniscripts.git

and then:

from uniscripts import is_script, Scripts
>>> is_script(u"ελληνικά means greek", Scripts.LATIN)
False

>>> is_script(u"ελληνικά", Scripts.GREEK)
True

>>> is_script(u"гага", Scripts.CYRILLIC)
True

alphabet-detector was unreliable for me, as it returns the first word of the character name, which is often the script name, but not always. For example:

>>> from alphabet_detector import AlphabetDetector
>>> ad = AlphabetDetector()
>>> ad.detect_alphabet("𐲌")
{'OLD'}

>>> ad.detect_alphabet("º")
{'MASCULINE'}

uniscripts on the other hand correctly returns:

>>> from uniscripts import get_scripts
>>> get_scripts("𐲌")
{'Old_Hungarian'}

>>> get_scripts("º")
{'Latin', 'Common'}

Upvotes: 0

Eli

Reputation: 38949

Python's unicodedata is hugely helpful here as is this question/answer

I couldn't find any simple way of detecting a language without writing a whole module, and I figure I'll run into a lot of corner cases, so I wrote a library. Github page is here. With that, you can just:

pip install alphabet-detector

and then use it directly:

from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()

ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #False
ad.only_alphabet_chars(u"ελληνικά", "GREEK") #True
ad.only_alphabet_chars(u"frappé", "LATIN") #True
ad.only_alphabet_chars(u"hôtel lœwe", "LATIN") #True
ad.only_alphabet_chars(u"123 ångstrom ð áß", "LATIN") #True
ad.only_alphabet_chars(u"russian: гага", "LATIN") #False
ad.only_alphabet_chars(u"гага", "CYRILLIC") #True

I also wrote a few convenience methods for major languages:

ad.is_cyrillic(u"гага") #True  
ad.is_latin(u"howdy") #True
ad.is_cjk(u"hi") #False
ad.is_cjk(u'汉字') #True

Upvotes: 4