Capture accented characters by modifying a specific regex I have in python3

Question

I asked a question before about extracting matching groups of names & emails from a long email body text into tuples using a regex. Solution worked beautifully extracting names and emails for example from this text:

> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown 
> To: maria.brown@aaa.com, George Washington , =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
, Alan , Alec , =
Alejandro , Alex , Andrea =
, Andrea , Andres =
, Andres 
> Hi,
> Please reply ASAP with your RSVP
> Bye

And using this Regex:

 [:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)

Producing this output:

 [(Charlie Brown', 'aaa@aaa.com'),('','maria.brown@aaa.com'),('George Washington', 'george@washington.com'),('','thomas.jefferson@aaa.com'),('','thomas.alva.edison@aaa.com'),('Juan','juan@aaa.com',('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'),('Alejandro','aaa@aaa.com'),('Alex', 'aaa@aaa.com'),('Andrea','andrea.mery@thomsen.cl'),('Andrea','andrea.22@aaa.com',('Andres','andres@aaa.com'),('Andres','avaldivieso@aaa.com')]

But, I stumbled upon instances where the names on the texts I passed to the regex have special accented characters. How can I update the regex above to not break and also capture names that include accented characters like:

 "á", "é", "í", "ó", "ú", "ç", "ö", "ü", "ñ", "à", "è", "ì", "ò", "ù"

(and their UPPER counterparts)

Thanks!

Avinash Raj · Accepted Answer

Use regex module instead of re to support unicode regex. I just changed [a-z]+ to \p{L}+ (which matches any kind of letter from any language) in the pattern which capture the names.

>>> s = """> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Chrálié Brown 
> To: maria.brown@aaa.com, George Washington , =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
, Alan , Alec , =
Alejandro , Alex , Andrea =
, Andrea , Andres =
, Andres 
> Hi,
> Please reply ASAP with your RSVP
> Bye"""
>>> import regex
>>> regex.findall(r'[:,]\s*=?\s*(?:([A-Z]\p{L}+(?:\s[A-Z]\p{L}+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)', s)
[('Chrálié Brown', 'aaa@aa-aaa.com'), ('', 'maria.brown@aaa.com'), ('George Washington', 'george@washington.com'), ('', 'thomas.jefferson@aaa.com'), ('', 'thomas.alva.edison@aaa.com'), ('Juan', 'juan@aaa.com'), ('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'), ('Alejandro', 'aaa@aaa.com'), ('Alex', 'aaa@planeas.com'), ('Andrea', 'andrea.mery@thomsen.cl'), ('Andrea', 'andrea.22@aaa.com'), ('Andres', 'andres@aaa.com'), ('Andres', 'avaldivieso@aaa.com')]

Capture accented characters by modifying a specific regex I have in python3

Answers (1)

Related Questions