Reputation: 1054

How to extract all names from a block of text

I'm trying to scrape names from a chunk of text (from an email body actually) that normally looks similar to this:

From: [email protected]
CC: John Smith <[email protected]>, Charles <[email protected]>, Mary Lamb <[email protected]>, Chino <[email protected]>, Claudia <[email protected]>, <[email protected]>, <[email protected]>, John <[email protected]>
Hi there AAA! Hope you had a wonderful time
Best,
AAA

I would like to end up with a list variable that holds only the names (first and last if available) of everyone on the CC, discarding the rest of the information. What would be a simple and clean approach using regex? (this is not a test, it's a real app I'm working on and am stuck...). I already was able to extract all emails using a re.findall() with an email matching pattern I found.

Thanks

Upvotes: 0

Answers (4)

depsai

Reputation: 415

This will capture strictly what you need.

[:,]\s((?:(?![:,<]).)*)\s\<

use group 1 to get the text.

Upvotes: 0

Avinash Raj

Reputation: 174706

You could try the below.

>>> import re
>>> s = """From: [email protected]
... CC: John Smith <[email protected]>, Charles <[email protected]>, Mary Lamb <[email protected]>, Chino <[email protected]>, Claudia <[email protected]>, <[email protected]>, <[email protected]>, John <[email protected]>
... Hi there AAA! Hope you had a wonderful time
... Best,
... AAA"""
>>> re.findall(r'(?<=[:,]\s)[A-Z][a-z]+(?:\s[A-Z][a-z]+)?(?=\s<)', s)
['John Smith', 'Charles', 'Mary Lamb', 'Chino', 'Claudia', 'John']

Upvotes: 1

Andrew Luo

Reputation: 927

Use the regex:

re.findall("(?:CC: |, )([\w ]*) <\S*@\S*>", str)

Upvotes: 0

Amal Murali

Reputation: 76656

You can use this regex:

[:,] ([\w ]+) \<

RegEx Demo

>>> p = re.compile(ur'[:,] ([\w ]+) \<') 
>>> m = re.findall(p, text)
>>> print m
['John Smith', 'Charles', 'Mary Lamb', 'Chino', 'Claudia', 'John']

Upvotes: 3

How to extract all names from a block of text

Answers (4)

Related Questions