Reputation: 345
I will be gathering scattered emails from a larger CSV file. I am just now learning regex. I am trying to extract the emails from this example sentence. However, emails is populating with only the @ symbol and the letter immediately before that. Can you help me see what's going wrong?
import re
String = "'Jessica's email is [email protected], and Daniel's email is [email protected]. Edward's is [email protected], and his grandfather, Oscar's, is [email protected].'"
emails = re.findall(r'.[@]', String)
names = re.findall(r'[A-Z][a-z]*',String)
print(emails)
print(names)
Upvotes: 0
Views: 3949
Reputation: 403218
1. Emails
In [1382]: re.findall(r'\S+@\w+\.\w+', text)
Out[1382]:
['[email protected]',
'[email protected]',
'[email protected]',
'[email protected]']
How it works: All emails are [email protected]
. One thing to note is a bunch of characters surrounding @
, and the singular .
. So, we use \S
to demarcate anything that is not a whitespace. And +
is to search for 1 or more such characters. \w+\.\w+
is just a fancy way of saying search for a string that only has one .
in it.
2. Names
In [1375]: re.findall('[A-Z][\S]+(?=\')', text)
Out[1375]: ['Jessica', 'Daniel', 'Edward', 'Oscar']
How it works: Any word starting with an upper case. The (?=\')
is a lookahead. As you see, all names follow the pattern Name's
. We want everything before the apostrophe. Hence, the lookahead, which is not captured.
Now, if you want to map names to emails by capturing them together with one massive regex, you can. Jean-François Fabre's answer is a good start. But I recommend getting the basics down par first.
Upvotes: 2
Reputation: 140307
your regex e-mail is not working at all: emails = re.findall(r'.[@]', String)
matches anychar then @
.
I would try a different approach: match the sentences and extract name,e-mails couples with the following empiric assumptions (if your text changes too much, that would break the logic)
's"
and is
somewhere (using non-greedy .*?
to match all that is in between\w
matches any alphanum char (or underscore), and only one dot for domain (else it matches the final dot of the sentence)code:
import re
String = "'Jessica's email is [email protected], and Daniel's email is [email protected]. Edward's is [email protected], and his grandfather, Oscar's, is [email protected].'"
print(re.findall("(\w+)'s.*? is (\w+@\w+\.\w+)",String))
result:
[('Jessica', '[email protected]'), ('Daniel', '[email protected]'), ('Edward', '[email protected]'), ('Oscar', '[email protected]')]
converting to dict
would even give you a dictionary name => address:
{'Oscar': '[email protected]', 'Jessica': '[email protected]', 'Daniel': '[email protected]', 'Edward': '[email protected]'}
The general case needs more chars (not sure I'm exhaustive):
String = "'Jessica's email is [email protected], and Daniel's email is [email protected]. Edward's is [email protected], and his grandfather, Oscar's, is [email protected].'"
print(re.findall("(\w+)'s.*? is ([\w\-.]+@[\w\-.]+\.[\w\-]+)",String))
result:
[('Jessica', '[email protected]'), ('Daniel', '[email protected]'), ('Edward', '[email protected]'), ('Oscar', '[email protected]')]
Upvotes: 5
Reputation: 129
You need to find anchors, patterns to match. An improved pattern could be:
import re
String = "'Jessica's email is [email protected], and Daniel's email is
[email protected]. Edward's is [email protected], and his
grandfather, Oscar's, is [email protected].'"
emails = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', String)
names = re.findall(r'[A-Z][a-z]*', String)
print(emails)
print(names)
\w+ is missing '-' which are allowed in email adresses.
Upvotes: 1
Reputation: 2540
This is because you are not using the repeat operator. The below code uses the + operator which means the characters / sub patterns just before it can repeat 1 to many times.
s = '''Jessica's email is [email protected], and Daniel's email is [email protected]. Edward's is [email protected], and his grandfather, Oscar's, is [email protected].'''
p = r'[a-z0-9]+@[a-z]+\.[a-z]+'
ans = re.findall(p, s)
print(ans)
Upvotes: 0