Reputation: 61
I am trying to write a regular expression that would match Unicode lowercase characters in Python 3. I'm using the re
library. For example, re.findall(some_pattern, 'u∏ñKθ')
should return ['u', 'ñ', 'θ']
.
In Sublime Text, I could simply type [[:lower:]]
to find these characters.
I'm aware that Python can match on any Unicode character with re.compile('[^\W\d_]')
, but I specifically need to differentiate between uppercase and lowercase. I'm also aware that re.compile('[a-z]')
would match any ASCII lowercase character, but my data is UTF-8, and it includes lots of non-ASCII characters—I checked.
Is this possible with regular expressions in Python 3, or will I need to take an alternative approach? I know other ways to do it. I was just hoping to use regex.
Upvotes: 3
Views: 1474
Reputation: 12939
I created the following Python script to get all of the lowercase letters according to the islower()
string method, and find the contiguous ranges to create a large character class regex that identifies all lowercase letters.
import itertools, re, sys
lowercaseOrdinals = [i for i in range(sys.maxunicode) if chr(i).islower()]
def ranges(i):
for a, b in itertools.groupby(enumerate(i), lambda pair: pair[1] - pair[0]):
b = list(b)
yield b[0][1], b[-1][1]
lowercaseRegex = '['
for startRangeOrdinal, endRangeOrdinal in ranges(lowercaseOrdinals):
if startRangeOrdinal == endRangeOrdinal:
lowercaseRegex += chr(startRangeOrdinal)
else:
lowercaseRegex += chr(startRangeOrdinal) + '-' + chr(endRangeOrdinal) lowercaseRegex += ']'
with open('lower.txt', 'w', encoding='utf-8') as fileObj:
fileObj.write(lowercaseRegex)
The 872 character-long character class regex it produces that matches any lowercase unicode character is here:
[a-zªµºß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıijĵķ-ĸĺļľŀłńņň-ʼnŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž-ƀƃƅƈƌ-ƍƒƕƙ-ƛƞơƣƥƨƪ-ƫƭưƴƶƹ-ƺƽ-ƿdžljnjǎǐǒǔǖǘǚǜ-ǝǟǡǣǥǧǩǫǭǯ-ǰdzǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿ-ɀɂɇɉɋɍɏ-ʓʕ-ʸˀ-ˁˠ-ˤͅͱͳͷͺ-ͽΐά-ώϐ-ϑϕ-ϗϙϛϝϟϡϣϥϧϩϫϭϯ-ϳϵϸϻ-ϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎ-ӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣԥԧԩԫԭԯՠ-ֈა-ჺჽ-ჿᏸ-ᏽᲀ-ᲈᴀ-ᶿḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰ-ώᾀ-ᾇᾐ-ᾗᾠ-ᾧᾰ-ᾴᾶ-ᾷιῂ-ῄῆ-ῇῐ-ΐῖ-ῗῠ-ῧῲ-ῴῶ-ῷⁱⁿₐ-ₜℊℎ-ℏℓℯℴℹℼ-ℽⅆ-ⅉⅎⅰ-ⅿↄⓐ-ⓩⰰ-ⱞⱡⱥ-ⱦⱨⱪⱬⱱⱳ-ⱴⱶ-ⱽⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣ-ⳤⳬⳮⳳⴀ-ⴥⴧⴭꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙡꙣꙥꙧꙩꙫꙭꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꚙꚛ-ꚝꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯ-ꝸꝺꝼꝿꞁꞃꞅꞇꞌꞎꞑꞓ-ꞕꞗꞙꞛꞝꞟꞡꞣꞥꞧꞩꞯꞵꞷꞹꞻꞽꞿꟃꟈꟊꟶꟸ-ꟺꬰ-ꭚꭜ-ꭨꭰ-ꮿff-stﬓ-ﬗa-z𐐨-𐑏𐓘-𐓻𐳀-𐳲𑣀-𑣟𖹠-𖹿𝐚-𝐳𝑎-𝑔𝑖-𝑧𝒂-𝒛𝒶-𝒹𝒻𝒽-𝓃𝓅-𝓏𝓪-𝔃𝔞-𝔷𝕒-𝕫𝖆-𝖟𝖺-𝗓𝗮-𝘇𝘢-𝘻𝙖-𝙯𝚊-𝚥𝛂-𝛚𝛜-𝛡𝛼-𝜔𝜖-𝜛𝜶-𝝎𝝐-𝝕𝝰-𝞈𝞊-𝞏𝞪-𝟂𝟄-𝟉𝟋𞤢-𞥃]
Additionally, the 839 character-long uppercase version of this regex is here:
[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸ-ŹŻŽƁ-ƂƄƆ-ƇƉ-ƋƎ-ƑƓ-ƔƖ-ƘƜ-ƝƟ-ƠƢƤƦ-ƧƩƬƮ-ƯƱ-ƳƵƷ-ƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺ-ȻȽ-ȾɁɃ-ɆɈɊɌɎͰͲͶͿΆΈ-ΊΌΎ-ΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹ-ϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀ-ӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱ-ՖႠ-ჅჇჍᎠ-ᏵᲐ-ᲺᲽ-ᲿḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ΆῈ-ΉῘ-ΊῨ-ῬῸ-Ώℂℇℋ-ℍℐ-ℒℕℙ-ℝℤΩℨK-ℭℰ-ℳℾ-ℿⅅⅠ-ⅯↃⒶ-ⓏⰀ-ⰮⱠⱢ-ⱤⱧⱩⱫⱭ-ⱰⱲⱵⱾ-ⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽ-ꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦꞨꞪ-ꞮꞰ-ꞴꞶꞸꞺꞼꞾꟂꟄ-ꟇꟉꟵA-Z𐐀-𐐧𐒰-𐓓𐲀-𐲲𑢠-𑢿𖹀-𖹟𝐀-𝐙𝐴-𝑍𝑨-𝒁𝒜𝒞-𝒟𝒢𝒥-𝒦𝒩-𝒬𝒮-𝒵𝓐-𝓩𝔄-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔸-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕬-𝖅𝖠-𝖹𝗔-𝗭𝘈-𝘡𝘼-𝙕𝙰-𝚉𝚨-𝛀𝛢-𝛺𝜜-𝜴𝝖-𝝮𝞐-𝞨𝟊𞤀-𞤡🄰-🅉🅐-🅩🅰-🆉]
Upvotes: 0
Reputation: 18621
You do not need to use PyPi regex
library (although I'd urge to use it here since it supports POSIX character classes (like [:lower:]
) and Unicode property classes like \p{Ll}
):
import re, sys
lower = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).islower()]))
print(re.findall(lower, 'ABCabcÆæóżźę'))
# => ['a', 'b', 'c', 'æ', 'ó', 'ż', 'ź', 'ę']
See Python proof.
Results:
['a', 'b', 'c', 'æ', 'ó', 'ż', 'ź', 'ę']
Upvotes: 1
Reputation: 103884
You can use the regex module that supports POSIX character classes:
import regex
>>> regex.findall('[[:lower:]]', 'u∏ñKθ')
['u', 'ñ', 'θ']
Or, use the Unicode Category Class of \p{Ll}
or \p{Lowercase_Letter}
:
>>> regex.findall(r'\p{Ll}', 'u∏ñKθ')
['u', 'ñ', 'θ']
Or just use Python's string logic:
>>> [c for c in 'u∏ñKθ' if c.islower()]
['u', 'ñ', 'θ']
In either case, beware of string such as this:
>>> s2='\u0061\u0300\u00E0'
>>> s2
'àà'
The first grapheme 'à'
is the result of an 'a'
with the combining character of '̀'
where the second 'à'
is the result of that specific code point. If you use a character class here, it will match 'a'
and not the combining accent:
>>> regex.findall('[[:lower:]]', s2)
['a', 'à']
>>> [c for c in s2 if c.islower()]
['a', 'à']
To solve that, you need to account for that in more complicated regex patterns or normalize the string:
>>> regex.findall('[[:lower:]]', unicodedata.normalize('NFC',s2))
['à', 'à']
or loop through grapheme by grapheme:
>>> [c for c in regex.findall(r'\X', s2) if c.islower()]
['à', 'à']
Upvotes: 6
Reputation: 55699
You can use the regex package if using a third party package is acceptable.
>>> import regex
>>> s = 'ABCabcÆæ'
>>> m = regex.findall(r'[[:lower:]]', s)
>>> m
['a', 'b', 'c', 'æ']
Upvotes: 3
Reputation: 3011
Try this -
import re
a = re.findall('[^\W\d_]', 'u∏ñKθ')
for i in a:
if i.islower():
print(i)
re.findall
finds all the UTF-8 characters, then filter it out with .islower()
to check if they are lower case characters and then print it
Upvotes: -1