Reputation: 13123
How can I match a letter from any language using a regex in python 3?
re.match([a-zA-Z])
will match the english language characters but I want all languages to be supported simultaneously.
I don't wish to match the '
in can't
or underscores or any other type of formatting. I do wish my regex to match: c
, a
, n
, t
, Å
, é
, and 中
.
Upvotes: 21
Views: 12542
Reputation: 3391
As noted by others, it would be very difficult to keep the up-to-date database of all letters in all existing languages. But in most cases you don't actually need that and it can be perfectly fine for your code to begin by supporing just several chosen languages and adding others as needed.
The following simple code supports matching for Czech, German and Polish language. The character sets can be easily obtained from Wikipedia.
import re
LANGS = [
'ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž', # Czech
'ÄäÖöÜüẞß', # German
'ĄąĆćĘꣳŃńÓ󌜏źŻż', # Polish
]
pattern = '[A-Za-z{langs}]'.format(langs=''.join(LANGS))
pattern = re.compile(pattern)
result = pattern.findall('Žluťoučký kůň')
print(result)
# ['Ž', 'l', 'u', 'ť', 'o', 'u', 'č', 'k', 'ý', 'k', 'ů', 'ň']
Upvotes: 0
Reputation: 80384
For Unicode regex work in Python, I very strongly recommend the following:
regex
library instead of standard re
, which is not really suitable for Unicode regular expressions..encode
and such, you’re almost certainly doing something wrong.Once you do this, you can safely write patterns that include \w
or \p{script=Latin}
or \p{alpha}
and \p{lower}
etc and know that these will all do what the Unicode Standard says they should. I explain all of this business of Python Unicode regex business in much more detail in this answer. The short story is to always use regex
not re
.
For general Unicode advice, I also have several talks from last OSCON about Unicode regular expressions, most of which apart from the 3rd talk alone is not about Python, but much of which is adaptable.
Finally, there’s always this answer to put the fear of God (or at least, of Unicode) in your heart.
Upvotes: 24
Reputation: 13123
import re text = "can't, Å, é, and 中ABC" print(re.findall('\w+', text))
This works in Python 3. But it also matches underscores. However this seems to do the job as I wish:
import regex text = "can't, Å, é, and 中ABC _ sh_t" print(regex.findall('\p{alpha}+', text))
Upvotes: 1
Reputation: 20372
What's wrong with using the \w special sequence?
# -*- coding: utf-8 -*-
import re
test = u"can't, Å, é, and 中ABC"
print re.findall('\w+', test, re.UNICODE)
Upvotes: 7
Reputation: 354476
You can match on
\p{L}
which matches any Unicode code point that represents a letter of a script. That is, assuming you actually have a Unicode-capable regex engine, which I really hope Python would have.
Upvotes: 4
Reputation: 3323
Build a match class of all the characters you want to match. This might become very, very large. No, there is no RegEx shorthand for "All Kanji" ;)
Maybe it is easier to match for what you do not want, but even then, this class would become extremely large.
Upvotes: 1