FlyingLightning
FlyingLightning

Reputation: 165

Should I use Python casefold?

Been recently reading on casefold and string comparisons when ignoring case. I've read that the MSDN standard is to use InvariantCulture and definitely avoid toLowercase. However, casefold from what I have read is like a more aggressive toLowercase. My question is should I use casefold in Python or is there a more pythonic standard to use instead? Also, does casefold pass the Turkey Test?

Upvotes: 7

Views: 5444

Answers (2)

Andj
Andj

Reputation: 1364

An old question, but I'll add an answer for future reference, and flesh out the discussion of case folding and insensitive matching in Python and Unicode.

Unicode defines two sets of operations, the first is case-mapping. For case mapping, Unicode defines three cases: lowercase, uppercase, and titlecase. These are string transformations changing text from one case to another.

Likewise, Unicode defines case folding operations. Case folding is designed to remove case distinctions before comparing strings or matching strings. This is different from the comparison operations between two strings defined in collation.

If you need to match or compare strings case insensitively, use case folding.

If you want to transform text or to do case sensitive string comparison use case mapping operations.

The key source of data on case folding is the CaseFolding.txt file in the UCD.

Three types methodologies are defined for case folding:

  1. Simple casefolding. This is used when you want to minimise the size of the data you need to work with. It can be found in embedded systems, and is used in some regex engines. It involves folding single characters to single characters. Simple casefolding uses the mappings with status C and S.
  2. Full casefolding. This is what str.casefold uses. Individual characters could be mapped to a sequence of characters. Full casefolding uses the mappings with status C and F.
  3. Turkic tailoring for Turkish, Azerbaijani, Uzbek, Tatar and Kazakh. This is an optional folding that by default isn't used, but is an available option to casefolding in Unicode.

Casefolding is not a string operation that is sensitive to locales or languages, except for the option to use Turkic exceptions. It is also important to note that case insensitivity using str.casefold differs from case insensitivity in the re module.

Casefolding is a building block to other matching algorithms, including canonical caseless matching, compatibility caseless matching, and identifier matching.

As has been noted in other answers , Python doesn't provide access to the Turkish tailorings when casefolding.

There are two approaches:

  1. Build a custom function to class to handle casefolding while using str.casefold, or
  2. Make use of PyICU, a wrapper around icu4c

I'll use PyICU using icu.Char and icu.CaseMap classes:

def toCasefold(text:str, full:bool = True, turkic:bool = False) -> str:
    # Enumerated consonants to use with icu.CaseMap:
    # icu.U_FOLD_CASE_DEFAULT : 0
    # icu.U_FOLD_CASE_EXCLUDE_SPECIAL_I : 1
    # 
    # Enumerated consonants in icu.Char:
    # icu.Char.FOLD_CASE_DEFAULT : 0
    # icu.Char.FOLD_CASE_EXCLUDE_SPECIAL_I : 1

    option:int = 1 if turkic else 0
    if not full:
        return "".join([icu.Char.foldCase(char, option) for char in text])
    return icu.CaseMap.fold(option, text)

city = 'DİYARBAKIR'

# Default casefold of string in Python,
#
print(city.toCasefold())
# di̇yarbakir

# Full case folding in PyICU
# Could also use icu.UnicodeString.foldCase
#
print(toCasefold(city))
# di̇yarbakir

# Full case folding in PyICU, using Turkic rules in Casefolding.txt
# Could also use icu.UnicodeString.foldCase
#
print(toCasefold(city, turkic=True))
# diyarbakır

# Simple case folding in PyICU
#
print(toCasefold(city, full=False))
# dİyarbakir

# Simple case folding in PyICU, using Turkic rules in Casefolding.txt
# Could also use icu.UnicodeString.foldCase
#
print(toCasefold(city, full=False, turkic=True))
# diyarbakır

So DİYARBAKIR can be casefolded according to Unicode rules to di̇yarbakir, diyarbakır, or dİyarbakir, depending on the type of case folding and the options applied.

Upvotes: 0

SergiyKolesnikov
SergiyKolesnikov

Reputation: 7795

1) In Python 3, casefold() should be used to implement caseless string matching.

Starting with Python 3.0, strings are stored as Unicode. The Unicode Standard Chapter 3.13 defines the default caseless matching as follows:

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

Python's casefold() implements the Unicode's toCasefold(). So, it should be used to implement caseless string matching. Although, casefolding alone is not enough to cover some corner cases and to pass the Turkey Test (see Point 3).

2) As of Python 3.6, casefold() cannot pass the Turkey Test.

For two characters, uppercase I and dotted uppercase I, the Unicode Standard defines two different casefolding mappings.

The default (for non-Turkic languages):
I → i (U+0049 → U+0069)
İ → i̇ (U+0130 → U+0069 U+0307)

The alternative (for Turkic languages):
I → ı (U+0049 → U+0131)
İ → i (U+0130 → U+0069)

Pythons casefold() can apply only the default mapping and fails the Turkey Test. For example, the Turkish words "LİMANI" and "limanı" are caseless equivalents, but "LİMANI".casefold() == "limanı".casefold() returns False. There is no option to enable the alternative mapping.

3) How to do caseless string matching in Python 3.

The Unicode Standard Chapter 3.13 describes several caseless matching algorithms. The canonical casless matching would probably suit most use cases. This algorithm already takes into account all corner cases. We only need to add an option to switch between non-Turkic and Turkic casefolding.

import unicodedata

def normalize_NFD(string):
    return unicodedata.normalize('NFD', string)

def casefold_(string, include_special_i=False):
    if include_special_i:
        string = unicodedata.normalize('NFC', string)
        string = string.replace('\u0049', '\u0131')
        string = string.replace('\u0130', '\u0069')
    return string.casefold()

def casefold_NFD(string, include_special_i=False):
    return normalize_NFD(casefold_(normalize_NFD(string), include_special_i))

def caseless_match(string1, string2, include_special_i=False):
    return  casefold_NFD(string1, include_special_i) == casefold_NFD(string2, include_special_i)

casefold_() is a wrapper for Python's casefold(). If its parameter include_special_i is set to True, then it applies the Turkic mapping, and if it is set to False the default mapping is used.

caseless_match() does the canonical casless matching for string1 and string2. If the strings are Turkic words, include_special_i parameter must be set to True.

Examples:

>>> caseless_match('LİMANI', 'limanı', include_special_i=True)
True
>>> caseless_match('LİMANI', 'limanı')
False
>>> caseless_match('INTENSIVE', 'intensive', include_special_i=True)
False
>>> caseless_match('INTENSIVE', 'intensive')
True

Upvotes: 19

Related Questions