Nik
Nik

Reputation: 61

How do I match all unicode lowercase characters in Python with a regular expression?

I am trying to write a regular expression that would match Unicode lowercase characters in Python 3. I'm using the re library. For example, re.findall(some_pattern, 'u∏ñKθ') should return ['u', 'ñ', 'θ'].

In Sublime Text, I could simply type [[:lower:]] to find these characters.

I'm aware that Python can match on any Unicode character with re.compile('[^\W\d_]'), but I specifically need to differentiate between uppercase and lowercase. I'm also aware that re.compile('[a-z]') would match any ASCII lowercase character, but my data is UTF-8, and it includes lots of non-ASCII characters—I checked.

Is this possible with regular expressions in Python 3, or will I need to take an alternative approach? I know other ways to do it. I was just hoping to use regex.

Upvotes: 3

Views: 1474

Answers (5)

Al Sweigart
Al Sweigart

Reputation: 12939

I created the following Python script to get all of the lowercase letters according to the islower() string method, and find the contiguous ranges to create a large character class regex that identifies all lowercase letters.

import itertools, re, sys

lowercaseOrdinals = [i for i in range(sys.maxunicode) if chr(i).islower()]

def ranges(i):
    for a, b in itertools.groupby(enumerate(i), lambda pair: pair[1] - pair[0]):
        b = list(b)
        yield b[0][1], b[-1][1]

lowercaseRegex = '['
for startRangeOrdinal, endRangeOrdinal in ranges(lowercaseOrdinals):
    if startRangeOrdinal == endRangeOrdinal:
        lowercaseRegex += chr(startRangeOrdinal)
    else:
        lowercaseRegex += chr(startRangeOrdinal) + '-' + chr(endRangeOrdinal)    lowercaseRegex += ']'

with open('lower.txt', 'w', encoding='utf-8') as fileObj:
    fileObj.write(lowercaseRegex)

The 872 character-long character class regex it produces that matches any lowercase unicode character is here:

[a-zªµºß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıijĵķ-ĸĺļľŀłńņň-ʼnŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž-ƀƃƅƈƌ-ƍƒƕƙ-ƛƞơƣƥƨƪ-ƫƭưƴƶƹ-ƺƽ-ƿdžljnjǎǐǒǔǖǘǚǜ-ǝǟǡǣǥǧǩǫǭǯ-ǰdzǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿ-ɀɂɇɉɋɍɏ-ʓʕ-ʸˀ-ˁˠ-ˤͅͱͳͷͺ-ͽΐά-ώϐ-ϑϕ-ϗϙϛϝϟϡϣϥϧϩϫϭϯ-ϳϵϸϻ-ϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎ-ӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣԥԧԩԫԭԯՠ-ֈა-ჺჽ-ჿᏸ-ᏽᲀ-ᲈᴀ-ᶿḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰ-ώᾀ-ᾇᾐ-ᾗᾠ-ᾧᾰ-ᾴᾶ-ᾷιῂ-ῄῆ-ῇῐ-ΐῖ-ῗῠ-ῧῲ-ῴῶ-ῷⁱⁿₐ-ₜℊℎ-ℏℓℯℴℹℼ-ℽⅆ-ⅉⅎⅰ-ⅿↄⓐ-ⓩⰰ-ⱞⱡⱥ-ⱦⱨⱪⱬⱱⱳ-ⱴⱶ-ⱽⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣ-ⳤⳬⳮⳳⴀ-ⴥⴧⴭꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙡꙣꙥꙧꙩꙫꙭꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꚙꚛ-ꚝꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯ-ꝸꝺꝼꝿꞁꞃꞅꞇꞌꞎꞑꞓ-ꞕꞗꞙꞛꞝꞟꞡꞣꞥꞧꞩꞯꞵꞷꞹꞻꞽꞿꟃꟈꟊꟶꟸ-ꟺꬰ-ꭚꭜ-ꭨꭰ-ꮿff-stﬓ-ﬗa-z𐐨-𐑏𐓘-𐓻𐳀-𐳲𑣀-𑣟𖹠-𖹿𝐚-𝐳𝑎-𝑔𝑖-𝑧𝒂-𝒛𝒶-𝒹𝒻𝒽-𝓃𝓅-𝓏𝓪-𝔃𝔞-𝔷𝕒-𝕫𝖆-𝖟𝖺-𝗓𝗮-𝘇𝘢-𝘻𝙖-𝙯𝚊-𝚥𝛂-𝛚𝛜-𝛡𝛼-𝜔𝜖-𝜛𝜶-𝝎𝝐-𝝕𝝰-𝞈𝞊-𝞏𝞪-𝟂𝟄-𝟉𝟋𞤢-𞥃]

Additionally, the 839 character-long uppercase version of this regex is here:

[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸ-ŹŻŽƁ-ƂƄƆ-ƇƉ-ƋƎ-ƑƓ-ƔƖ-ƘƜ-ƝƟ-ƠƢƤƦ-ƧƩƬƮ-ƯƱ-ƳƵƷ-ƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺ-ȻȽ-ȾɁɃ-ɆɈɊɌɎͰͲͶͿΆΈ-ΊΌΎ-ΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹ-ϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀ-ӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱ-ՖႠ-ჅჇჍᎠ-ᏵᲐ-ᲺᲽ-ᲿḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ΆῈ-ΉῘ-ΊῨ-ῬῸ-Ώℂℇℋ-ℍℐ-ℒℕℙ-ℝℤΩℨK-ℭℰ-ℳℾ-ℿⅅⅠ-ⅯↃⒶ-ⓏⰀ-ⰮⱠⱢ-ⱤⱧⱩⱫⱭ-ⱰⱲⱵⱾ-ⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽ-ꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦꞨꞪ-ꞮꞰ-ꞴꞶꞸꞺꞼꞾꟂꟄ-ꟇꟉꟵA-Z𐐀-𐐧𐒰-𐓓𐲀-𐲲𑢠-𑢿𖹀-𖹟𝐀-𝐙𝐴-𝑍𝑨-𝒁𝒜𝒞-𝒟𝒢𝒥-𝒦𝒩-𝒬𝒮-𝒵𝓐-𝓩𝔄-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔸-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕬-𝖅𝖠-𝖹𝗔-𝗭𝘈-𝘡𝘼-𝙕𝙰-𝚉𝚨-𝛀𝛢-𝛺𝜜-𝜴𝝖-𝝮𝞐-𝞨𝟊𞤀-𞤡🄰-🅉🅐-🅩🅰-🆉]

Upvotes: 0

Ryszard Czech
Ryszard Czech

Reputation: 18621

You do not need to use PyPi regex library (although I'd urge to use it here since it supports POSIX character classes (like [:lower:]) and Unicode property classes like \p{Ll}):

import re, sys
lower = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))
print(re.findall(lower, 'ABCabcÆæóżźę'))
# => ['a', 'b', 'c', 'æ', 'ó', 'ż', 'ź', 'ę']

See Python proof.

Results:

['a', 'b', 'c', 'æ', 'ó', 'ż', 'ź', 'ę']

Upvotes: 1

dawg
dawg

Reputation: 103884

You can use the regex module that supports POSIX character classes:

import regex 

>>> regex.findall('[[:lower:]]', 'u∏ñKθ')
['u', 'ñ', 'θ']

Or, use the Unicode Category Class of \p{Ll} or \p{Lowercase_Letter}:

>>> regex.findall(r'\p{Ll}', 'u∏ñKθ')
['u', 'ñ', 'θ']

Or just use Python's string logic:

>>> [c for c in 'u∏ñKθ' if c.islower()]
['u', 'ñ', 'θ']

In either case, beware of string such as this:

>>> s2='\u0061\u0300\u00E0'
>>> s2
'àà'

The first grapheme 'à' is the result of an 'a' with the combining character of '̀' where the second 'à' is the result of that specific code point. If you use a character class here, it will match 'a' and not the combining accent:

>>> regex.findall('[[:lower:]]', s2)
['a', 'à']
>>> [c for c in s2 if c.islower()]
['a', 'à']

To solve that, you need to account for that in more complicated regex patterns or normalize the string:

>>> regex.findall('[[:lower:]]', unicodedata.normalize('NFC',s2))
['à', 'à']

or loop through grapheme by grapheme:

>>> [c for c in regex.findall(r'\X', s2) if c.islower()]
['à', 'à']

Upvotes: 6

snakecharmerb
snakecharmerb

Reputation: 55699

You can use the regex package if using a third party package is acceptable.

>>> import regex
>>> s = 'ABCabcÆæ'
>>> m = regex.findall(r'[[:lower:]]', s)
>>> m
['a', 'b', 'c', 'æ']

Upvotes: 3

PCM
PCM

Reputation: 3011

Try this -

import re

a = re.findall('[^\W\d_]', 'u∏ñKθ')

for i in a:
    if i.islower():
        print(i)

re.findall finds all the UTF-8 characters, then filter it out with .islower() to check if they are lower case characters and then print it

Upvotes: -1

Related Questions