alvas
alvas

Reputation: 122112

How to Include accented words in regex

I have a utf-8 text with capitalized words within the text:

La cinta, que hoy se estrena en nuestro país, competirá contra Hors la
Loi, de Argelia, Dogtooth, de Grecia, Incendies, de Canadá, Life above
all , de Sudáfrica, y con la ganadora del Globo de Oro, In A Better
World, de Dinamarca.

The desired output is to replace all words that starts with a capital letter to a placeholder (i.e. #NE#), except for the first word. So the desired output look as such:

La cinta, que hoy se estrena en nuestro país, competirá contra  #NE#
la  #NE# , de #NE# ,  #NE# , de  #NE# ,  #NE# , de  #NE#,  #NE# above
all , de #NE# , y con la ganadora del  #NE# de  #NE# ,  #NE# A #NE# #NE# , de  #NE# .

I've tried using regex as follows:

>>> import re
>>> def blind_CAPS_without_first_word(text):
...     first_word, _, the_rest = text.partition(' ')
...     blinded = re.sub('(?:[A-Z][\w]+\s*)', ' #NE# ', the_rest)
...     return " ".join([first_word, blinded])
... 
>>> text = "La cinta, que hoy se estrena en nuestro país, competirá contra Hors la Loi, de Argelia, Dogtooth, de Grecia, Incendies, de Canadá, Life above all , de Sudáfrica, y con la ganadora del Globo de Oro, In A Better World, de Dinamarca."
>>> blind_CAPS_without_first_word(text)

[out]:

La cinta, que hoy se estrena en nuestro país, competirá contra #NE# la #NE# , de #NE# , #NE# , de #NE# , #NE# , de #NE# á, #NE# above all , de #NE# áfrica, y con la ganadora del #NE# de #NE# , #NE# A #NE# #NE# , de #NE# .

But the regex didn't consider accented characters when using \w, e.g. Canadá -> #NE# á; Sudáfrica -> #NE# áfrica. How do I get around this? How to include accented words in my regex? It needs to be Canadá -> #NE#; Sudáfrica -> #NE#.

I guess it's okay if to ignore single character words like A remains as A. Unless there's a get around for this.

Upvotes: 1

Views: 4696

Answers (2)

CLaFarge
CLaFarge

Reputation: 1365

Any chance you could use unicode notation to capture ranges of characters? Example: [\xC0-\xE1] or something? I ran it by Pythex and it didn't seem to mind... you'll need to find your own range, but it's a start :)

Hope this helps.

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174706

Because \w+ or [\w]+ won't match accented characters. So it fails to match those words.

You may use \S+ instead of \w+

re.sub(r'[A-Z]\S+\s*', ' #NE# ', the_rest)

OR

Use regex module if you only wants to match word chars of any language.

regex.sub(r'[A-Z]\p{L}+\s*', ' #NE# ', the_rest)

Upvotes: 7

Related Questions