Reputation: 1054
I asked a question before about extracting matching groups of names & emails from a long email body text into tuples using a regex. Solution worked beautifully extracting names and emails for example from this text:
> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <[email protected]>
> To: [email protected], George Washington <[email protected]>, =
[email protected], [email protected], Juan =
<[email protected]>, Alan <[email protected]>, Alec <[email protected]>, =
Alejandro <[email protected]>, Alex <[email protected]>, Andrea =
<[email protected]>, Andrea <[email protected]>, Andres =
<[email protected]>, Andres <[email protected]>
> Hi,
> Please reply ASAP with your RSVP
> Bye
And using this Regex:
[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)
Producing this output:
[(Charlie Brown', '[email protected]'),('','[email protected]'),('George Washington', '[email protected]'),('','[email protected]'),('','[email protected]'),('Juan','[email protected]',('Alan', '[email protected]'), ('Alec', '[email protected]'),('Alejandro','[email protected]'),('Alex', '[email protected]'),('Andrea','[email protected]'),('Andrea','[email protected]',('Andres','[email protected]'),('Andres','[email protected]')]
But, I stumbled upon instances where the names on the texts I passed to the regex have special accented characters. How can I update the regex above to not break and also capture names that include accented characters like:
"á", "é", "í", "ó", "ú", "ç", "ö", "ü", "ñ", "à", "è", "ì", "ò", "ù"
(and their UPPER counterparts)
Thanks!
Upvotes: 1
Views: 126
Reputation: 174706
Use regex
module instead of re
to support unicode regex. I just changed [a-z]+
to \p{L}+
(which matches any kind of letter from any language) in the pattern which capture the names.
>>> s = """> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Chrálié Brown <[email protected]>
> To: [email protected], George Washington <[email protected]>, =
[email protected], [email protected], Juan =
<[email protected]>, Alan <[email protected]>, Alec <[email protected]>, =
Alejandro <[email protected]>, Alex <[email protected]>, Andrea =
<[email protected]>, Andrea <[email protected]>, Andres =
<[email protected]>, Andres <[email protected]>
> Hi,
> Please reply ASAP with your RSVP
> Bye"""
>>> import regex
>>> regex.findall(r'[:,]\s*=?\s*(?:([A-Z]\p{L}+(?:\s[A-Z]\p{L}+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)', s)
[('Chrálié Brown', '[email protected]'), ('', '[email protected]'), ('George Washington', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('Juan', '[email protected]'), ('Alan', '[email protected]'), ('Alec', '[email protected]'), ('Alejandro', '[email protected]'), ('Alex', '[email protected]'), ('Andrea', '[email protected]'), ('Andrea', '[email protected]'), ('Andres', '[email protected]'), ('Andres', '[email protected]')]
Upvotes: 1