Reputation: 1054
I'm trying to scrape names from a chunk of text (from an email body actually) that normally looks similar to this:
From: [email protected]
CC: John Smith <[email protected]>, Charles <[email protected]>, Mary Lamb <[email protected]>, Chino <[email protected]>, Claudia <[email protected]>, <[email protected]>, <[email protected]>, John <[email protected]>
Hi there AAA! Hope you had a wonderful time
Best,
AAA
I would like to end up with a list variable that holds only the names (first and last if available) of everyone on the CC, discarding the rest of the information. What would be a simple and clean approach using regex? (this is not a test, it's a real app I'm working on and am stuck...). I already was able to extract all emails using a re.findall() with an email matching pattern I found.
Thanks
Upvotes: 0
Views: 190
Reputation: 415
This will capture strictly what you need.
[:,]\s((?:(?![:,<]).)*)\s\<
use group 1 to get the text.
Upvotes: 0
Reputation: 174706
You could try the below.
>>> import re
>>> s = """From: [email protected]
... CC: John Smith <[email protected]>, Charles <[email protected]>, Mary Lamb <[email protected]>, Chino <[email protected]>, Claudia <[email protected]>, <[email protected]>, <[email protected]>, John <[email protected]>
... Hi there AAA! Hope you had a wonderful time
... Best,
... AAA"""
>>> re.findall(r'(?<=[:,]\s)[A-Z][a-z]+(?:\s[A-Z][a-z]+)?(?=\s<)', s)
['John Smith', 'Charles', 'Mary Lamb', 'Chino', 'Claudia', 'John']
Upvotes: 1
Reputation: 76656
You can use this regex:
[:,] ([\w ]+) \<
>>> p = re.compile(ur'[:,] ([\w ]+) \<')
>>> m = re.findall(p, text)
>>> print m
['John Smith', 'Charles', 'Mary Lamb', 'Chino', 'Claudia', 'John']
Upvotes: 3