Reputation: 38949
Is there a library or other simple way to detect which alphabet characters belong to in Python? I know I can use unicode code ranges for this, but if there's already a built-in way or a library or some such that provides the mappings, I'd rather not reinvent the wheel.
Note: I'm asking about alphabet not language. Both "hello" and "hola" would map to Latin alphabet, whereas "Поиск" would map to Cyrillic.
Upvotes: 5
Views: 1684
Reputation: 1374
An alternative approach.
There are no core Python methods that provide information on scripts. It is necessary to look at alternatives. I'd suggest either unicodedataplus
or PyICU
.
unicodedataplus
is a drop in replacement for unicodedata
which is kept up to date, and provides additional methods, two of which are relevant: unicodedataplus.script()
and unicodedataplus.script()
extension.
The version of Unicode supported by the Python version you are using can be found using:
import unicodedata
print(unicodedata.unidata_version)
# 15.0.0
Python 3.12 will be using Unicode version 15.0. It is important to realise that the current versions of unicodedataplus
and icu4c
which is used by PyICU
may not match the version of Unicode used by your Python install, so scripts not supported by your version of Python may be reported as available.
It is possible to install a specific version of unicodedataplus
, the package's versioning number matches the Unicode version supported. Refer to unicodedataplus
release history for available versions. So for Python 3.12, if we want to match Python's Unicode version, we'd install version 15.0.0.post2
for unicodedataplus
:
pip install unicodedataplus==15.0.0.post2
All Unicode characters are assigned to a script, and unicodedataplus.script()
will return the name of that script:
import unicodedataplus as ud
print(ud.script('𖫐'))
# Bassa_Vah
We could also loop through a string and get all scripts in a string:
example = "La sottise, l'erreur, le péché, la lésine, occupent nos esprits et travaillent nos corps"
detected_scripts = set()
for char in example:
detected_scripts.add(ud.script(char))
print(detected_scripts)
# {'Latin', 'Common'}
For a string longer than one word, the code will return more than one script. IN Unicode , and likewise in ISO 15924 there are three special scripts:
We could rewrite the code above to:
detected_scripts = set()
for char in example:
s = ud.script(char)
if s not in ["Common", "Inherited", "Unknown"]:
detected_scripts.add(ud.script(char))
print(detected_scripts)
# {'Latin'}
It is also possible to get a list of scripts supported by Python:
scripts = set()
for i in range(int('10FFFF', 16) + 1):
s = ud.script(chr(i))
if s not in ["Common", "Inherited", "Unknown"]:
scripts.add(s)
available_scripts = sorted(scripts)
print(available_scripts)
Identifying what characters are in a script can be problematic. Available approaches and data will not report what Common and Inherited script characters are required in any specific script. The best we can do is to identify characters by script property.
Additionally, characters in many scripts may be non-contiguous, so constructing ranges would require processing Unicode data and building a dataset, data that would need to be undated with each Unicode release.
Assuming the Unicode versions used by Python and icu4c
match, we can get a list of characters in that script, for use in other functions. It is possible to create sets of characters using icu.UnicodeSet()
, using either POSIX or Perl notations with name of script or script code, and in short or long form:
[:Arabic:]
[:Arab:]
[\p{Arabic}]
[\p{Arab}]
[\p{sc=Arab}]
[\p{sc=Arabic}]
[\p{Script=Arabic}]
[\p{Script=Arab}]
[\p{sc:Arab}]
[\p{sc:Arabic}]
[\p{Script:Arabic}]
[\p{Script:Arab}]
import icu
script = "Latn"
script_characters = list(icu.UnicodeSet(f'\\p\u007B{script}\u007D'))
unicodedataplus
and pyicu
also allow you to check for script extensions, some characters characters are used in more than one script, and script extensions contains data on some of these characters.
print(ud.script("\u064E"))
# Inherited
print(ud.script_extensions("\u064E"))
# ['Arab', 'Syrc']
N.B. unicodedataplus.script_extensions()
returns a list of ISO 15924 four letter codes, rather than the script name.
In the above example fatha belongs to the Inherited script, but has script extensions to Arabic and Syriac scripts.
Script extension data is imcomplete, there are efforts underway within SAH and UTC to expand the script extension data.
Upvotes: 0
Reputation: 4822
The closest I could find to solve this was to use https://pypi.org/project/uniscripts/ which has not been updated in years but has the right approach by pulling the scripts from the unicode standard.
I updated to uniscripts to unicode 15.1 and submitted a merge request to the package maintainer. Meanwhile you can use it from my repository:
pip install git+https://github.com/gaspardpetit/uniscripts.git
and then:
from uniscripts import is_script, Scripts
>>> is_script(u"ελληνικά means greek", Scripts.LATIN)
False
>>> is_script(u"ελληνικά", Scripts.GREEK)
True
>>> is_script(u"гага", Scripts.CYRILLIC)
True
alphabet-detector
was unreliable for me, as it returns the first word of the character name, which is often the script name, but not always. For example:
>>> from alphabet_detector import AlphabetDetector
>>> ad = AlphabetDetector()
>>> ad.detect_alphabet("𐲌")
{'OLD'}
>>> ad.detect_alphabet("º")
{'MASCULINE'}
uniscripts
on the other hand correctly returns:
>>> from uniscripts import get_scripts
>>> get_scripts("𐲌")
{'Old_Hungarian'}
>>> get_scripts("º")
{'Latin', 'Common'}
Upvotes: 0
Reputation: 38949
Python's unicodedata is hugely helpful here as is this question/answer
I couldn't find any simple way of detecting a language without writing a whole module, and I figure I'll run into a lot of corner cases, so I wrote a library. Github page is here. With that, you can just:
pip install alphabet-detector
and then use it directly:
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #False
ad.only_alphabet_chars(u"ελληνικά", "GREEK") #True
ad.only_alphabet_chars(u"frappé", "LATIN") #True
ad.only_alphabet_chars(u"hôtel lœwe", "LATIN") #True
ad.only_alphabet_chars(u"123 ångstrom ð áß", "LATIN") #True
ad.only_alphabet_chars(u"russian: гага", "LATIN") #False
ad.only_alphabet_chars(u"гага", "CYRILLIC") #True
I also wrote a few convenience methods for major languages:
ad.is_cyrillic(u"гага") #True
ad.is_latin(u"howdy") #True
ad.is_cjk(u"hi") #False
ad.is_cjk(u'汉字') #True
Upvotes: 4