Reputation: 28534
I have the following Regex (see it in action in PCRE)
.*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$
However, Python doesn't upport unicode regex with \p{}
syntax. To solve this I read I could use the regex
module (not default re
), but this doesn't seem to work either. Not even with u
flag.
Example:
sentence = "valt nog zoveel zal kunnen zeggen, "
print(re.sub(".*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$","\1",sentence))
zeggen
This doesn't work with Python 3.4.3.
Upvotes: 1
Views: 2442
Reputation: 89629
As you can see unicode character classes like \p{L}
are not available in the re module. However it doesn't means that you can't do it with the re module since \p{L}
can be replaced with [^\W\d_]
with the UNICODE
flag (even if there are small differences between these two character classes, see the link in comments).
Second point, your approach is not the good one (if I understand well, you are trying to extract the last word of each line) because you have strangely decided to remove all that is not the last word (except the newline) with a replacement. ~52000 steps to extract 10 words in 10 lines of text is not acceptable (and will crash with more characters). A more efficient way consists to find all the last words, see this example:
import re
s = '''Ik heb nog nooit een kat gezien zo lélijk!
Het is een minder lelijk dan uw hond.'''
p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.M | re.U)
words = p.findall(s)
print('\n'.join(words))
Notices:
To obtain the same result with python 2.7 you only need to add an u
before the single quotes of the string: s = u'''...
If you absolutely want to limit results to letters avoiding digits and underscores, replace \w
with [^\W\d_]
in the pattern.
If you use the regex module, maybe the character class \p{IsLatin}
will be more appropriate for your use, or whatever the module you choose, a more explicit class with only the needed characters, something like: [A-Za-záéóú...
You can achieve the same with the regex module with this pattern:
p = regex.compile(r'^.*\m(?<!-)(\pL+(?:-\pL+)*)', regex.M | regex.U)
Other ways:
By line with the re module:
p = re.compile(r'[^\w-]+', re.U)
for line in s.split('\n'):
print(p.split(line+' ')[-2])
With the regex module you can take advantage of the reversed search:
p = regex.compile(r'(?r)\w+(?:-\w+)*\M', regex.U)
for line in s.split('\n'):
print p.search(line).group(0)
Upvotes: 3
Reputation: 2553
This post explains how to use unicode properties in python:
Python regex matching Unicode properties
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say
\p{Armenian}
to match Armenian characters.\p{Ll}
or\p{Zs}
work too.
Upvotes: -1