Reputation: 165
Been recently reading on casefold and string comparisons when ignoring case. I've read that the MSDN standard is to use InvariantCulture and definitely avoid toLowercase. However, casefold from what I have read is like a more aggressive toLowercase. My question is should I use casefold in Python or is there a more pythonic standard to use instead? Also, does casefold pass the Turkey Test?
Upvotes: 7
Views: 5444
Reputation: 1364
An old question, but I'll add an answer for future reference, and flesh out the discussion of case folding and insensitive matching in Python and Unicode.
Unicode defines two sets of operations, the first is case-mapping. For case mapping, Unicode defines three cases: lowercase, uppercase, and titlecase. These are string transformations changing text from one case to another.
Likewise, Unicode defines case folding operations. Case folding is designed to remove case distinctions before comparing strings or matching strings. This is different from the comparison operations between two strings defined in collation.
If you need to match or compare strings case insensitively, use case folding.
If you want to transform text or to do case sensitive string comparison use case mapping operations.
The key source of data on case folding is the CaseFolding.txt file in the UCD.
Three types methodologies are defined for case folding:
C
and S
.str.casefold
uses. Individual characters could be mapped to a sequence of characters. Full casefolding uses the mappings with status C
and F
.Casefolding is not a string operation that is sensitive to locales or languages, except for the option to use Turkic exceptions. It is also important to note that case insensitivity using str.casefold
differs from case insensitivity in the re
module.
Casefolding is a building block to other matching algorithms, including canonical caseless matching, compatibility caseless matching, and identifier matching.
As has been noted in other answers , Python doesn't provide access to the Turkish tailorings when casefolding.
There are two approaches:
str.casefold
, orI'll use PyICU using icu.Char
and icu.CaseMap
classes:
def toCasefold(text:str, full:bool = True, turkic:bool = False) -> str:
# Enumerated consonants to use with icu.CaseMap:
# icu.U_FOLD_CASE_DEFAULT : 0
# icu.U_FOLD_CASE_EXCLUDE_SPECIAL_I : 1
#
# Enumerated consonants in icu.Char:
# icu.Char.FOLD_CASE_DEFAULT : 0
# icu.Char.FOLD_CASE_EXCLUDE_SPECIAL_I : 1
option:int = 1 if turkic else 0
if not full:
return "".join([icu.Char.foldCase(char, option) for char in text])
return icu.CaseMap.fold(option, text)
city = 'DİYARBAKIR'
# Default casefold of string in Python,
#
print(city.toCasefold())
# di̇yarbakir
# Full case folding in PyICU
# Could also use icu.UnicodeString.foldCase
#
print(toCasefold(city))
# di̇yarbakir
# Full case folding in PyICU, using Turkic rules in Casefolding.txt
# Could also use icu.UnicodeString.foldCase
#
print(toCasefold(city, turkic=True))
# diyarbakır
# Simple case folding in PyICU
#
print(toCasefold(city, full=False))
# dİyarbakir
# Simple case folding in PyICU, using Turkic rules in Casefolding.txt
# Could also use icu.UnicodeString.foldCase
#
print(toCasefold(city, full=False, turkic=True))
# diyarbakır
So DİYARBAKIR
can be casefolded according to Unicode rules to di̇yarbakir, diyarbakır, or dİyarbakir, depending on the type of case folding and the options applied.
Upvotes: 0
Reputation: 7795
1) In Python 3, casefold()
should be used to implement caseless string matching.
Starting with Python 3.0, strings are stored as Unicode. The Unicode Standard Chapter 3.13 defines the default caseless matching as follows:
A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)
Python's casefold()
implements the Unicode's toCasefold()
. So, it should be used to implement caseless string matching. Although, casefolding alone is not enough to cover some corner cases and to pass the Turkey Test (see Point 3).
2) As of Python 3.6, casefold() cannot pass the Turkey Test.
For two characters, uppercase I and dotted uppercase I, the Unicode Standard defines two different casefolding mappings.
The default (for non-Turkic languages):
I → i (U+0049 → U+0069)
İ → i̇ (U+0130 → U+0069 U+0307)
The alternative (for Turkic languages):
I → ı (U+0049 → U+0131)
İ → i (U+0130 → U+0069)
Pythons casefold()
can apply only the default mapping and fails the Turkey Test. For example, the Turkish words "LİMANI" and "limanı" are caseless equivalents, but "LİMANI".casefold() == "limanı".casefold()
returns False
. There is no option to enable the alternative mapping.
3) How to do caseless string matching in Python 3.
The Unicode Standard Chapter 3.13 describes several caseless matching algorithms. The canonical casless matching would probably suit most use cases. This algorithm already takes into account all corner cases. We only need to add an option to switch between non-Turkic and Turkic casefolding.
import unicodedata
def normalize_NFD(string):
return unicodedata.normalize('NFD', string)
def casefold_(string, include_special_i=False):
if include_special_i:
string = unicodedata.normalize('NFC', string)
string = string.replace('\u0049', '\u0131')
string = string.replace('\u0130', '\u0069')
return string.casefold()
def casefold_NFD(string, include_special_i=False):
return normalize_NFD(casefold_(normalize_NFD(string), include_special_i))
def caseless_match(string1, string2, include_special_i=False):
return casefold_NFD(string1, include_special_i) == casefold_NFD(string2, include_special_i)
casefold_()
is a wrapper for Python's casefold()
. If its parameter include_special_i
is set to True
, then it applies the Turkic mapping, and if it is set to False
the default mapping is used.
caseless_match()
does the canonical casless matching for string1
and string2
. If the strings are Turkic words, include_special_i
parameter must be set to True
.
Examples:
>>> caseless_match('LİMANI', 'limanı', include_special_i=True)
True
>>> caseless_match('LİMANI', 'limanı')
False
>>> caseless_match('INTENSIVE', 'intensive', include_special_i=True)
False
>>> caseless_match('INTENSIVE', 'intensive')
True
Upvotes: 19