Unicode Regex with regex not working in Python

Question

I have the following Regex (see it in action in PCRE)

.*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$

However, Python doesn't upport unicode regex with \p{} syntax. To solve this I read I could use the regex module (not default re), but this doesn't seem to work either. Not even with u flag.

Example:

sentence = "valt nog zoveel zal kunnen zeggen, "

print(re.sub(".*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$","\1",sentence))

Output: < blank >
Expected output: zeggen

This doesn't work with Python 3.4.3.

Casimir et Hippolyte · Accepted Answer

As you can see unicode character classes like \p{L} are not available in the re module. However it doesn't means that you can't do it with the re module since \p{L} can be replaced with [^\W\d_] with the UNICODE flag (even if there are small differences between these two character classes, see the link in comments).

Second point, your approach is not the good one (if I understand well, you are trying to extract the last word of each line) because you have strangely decided to remove all that is not the last word (except the newline) with a replacement. ~52000 steps to extract 10 words in 10 lines of text is not acceptable (and will crash with more characters). A more efficient way consists to find all the last words, see this example:

import re

s = '''Ik heb nog nooit een kat gezien zo lélijk!
Het is een minder lelijk dan uw hond.'''

p = re.compile(r'^.*\b(?



Notices:


To obtain the same result with python 2.7 you only need to add an u before the single quotes of the string: s = u'''...
If you absolutely want to limit results to letters avoiding digits and underscores, replace \w with [^\W\d_] in the pattern.
If you use the regex module, maybe the character class \p{IsLatin} will be more appropriate for your use, or whatever the module you choose, a more explicit class with only the needed characters, something like: [A-Za-záéóú...  
You can achieve the same with the regex module with this pattern:

p = regex.compile(r'^.*\m(?



Other ways:

By line with the re module:

p = re.compile(r'[^\w-]+', re.U)
for line in s.split('
'):
    print(p.split(line+' ')[-2])


With the regex module you can take advantage of the reversed search:

p = regex.compile(r'(?r)\w+(?:-\w+)*\M', regex.U)
for line in s.split('
'):
    print p.search(line).group(0)

Unicode Regex with regex not working in Python

Answers (2)

Related Questions