Reputation: 1099
I want to know if by using regular expressions I am able to extract emails from the following strings?
The following RE pattern is .*@.*
match with all strings. It has worked fine with some of the string, though with not all.
I want to match all strings match with email pattern include all domain like (some-url.com) or (some-url.co.id)
boleh di kirim ke email saya [email protected] tks...
boleh minta kirim ke [email protected].
[email protected]. .
[email protected] Senior Quantity Surveyor
[email protected], terimakasih bu Cindy Hartanto
[email protected] saya mau dong bu cindy
[email protected]
Hi Cindy ...pls share the Salary guide to [email protected] thank a
Upvotes: 46
Views: 163270
Reputation: 1032
All answers here are fundamentally incorrect. Why are you trying to validate an email address? On the web, you should validate with a user-response anyway. The absolute worst thing you can do is to block a valid address. It's usually fine to permit an address that isn't matching perfectly, because that will simply be bounced down the line. However, you also don't want to allow things that can break your system: non-ASCII characters, and entries exceeding the length-limits.
According to RFC5322, the "local-part" (before @) may consist of any printable ASCII character, except "special characters"; or it may consist of a double-quoted string of any printable ASCII characters (excluding only the double-quote character unless preceded by \
). Even this is a simplification, because it doesn't account for the "Comments-with-folding-space" headache of the RFC. Ignoring that last part, you can create a compliant regex simply with:
^(\S+?)@(\S+)$
That might be a little too permissive, but it won't restrict valid RFC mail addresses, like many twits are want to do. If it's invalid, it will simply bounce. And, it's simple.
If you want to be a bit more restrictive without becoming non-conformant, you can use the [:graph:]
character class with length-limits:
^([[:graph:]]{1,64})@([[:graph:]]{1,255})$
Now you have something that is compliant, reasonably restrictive, and prevents buffer-overflows and nefarious activities. Just make sure your regex engine is not using UTF-8.
Now, according to the RFCs, the following characters are not allowed in some circumstances, defined as "special" characters:
< > ( ) [ ] : ; @ \ , . "
Further, the special characters are allowed in the local-part when quoted, except for the quote character itself, and in the domain part when bracketed. So for full compliance with more opportunities to filter out bad ones, we can still do:
^(?:("(?:[^\\"]|\\.){1,62}")|([[:graph:]]{1,64}))@(?:(\[[^]\\[]{1,253}\])|([[:graph:]]{1,255}))))$
We can probably do better than this, but it becomes ugly (see the solution way below), and usually at the price of compliance.
After matching the above, we need to further check our captures:
$1
) is the quoted-text of the local-part (if it exists); the second ($2
) is the dot-atom-text of the local-part (if no quoted text provided); the third ($3
) is the domain-literal (if bracketed) and the fourth ($4
) is the dot-atom-text of the domain-part (if not bracketed). Breaking it down this way allows us to more easily check the parts, because there are different rules for each.[:graph:]
character class, but that was too tricky to implement it correctly here. (Advanced double-extended perl regex's can combine character classes, but that's unlikely to be supported by other languages).[:graph:]
character class. The ugliness here is to ensure it does not contain a backslash or embedded bracket character..
(period) -- see below. Check separately for that. However, see 10..
, neither may begin nor end with it, and there may not be two consecutive one. It must be checked separately (or the length is checked separately, one or the other)._
in it, but it's perfectly fine as a local-host definition in both Unix and Microsoft, and by extension, Docker.So now you need to check the parts correctly:
$quoted_text = $1;
$local_dot_atom = $2;
$domain_literal = $3;
$domain_dot_atom = $4;
return 0 if ($quoted_text . $local_domain_literal) =~ /[^[:graph:]]/;
return 0 if ($domain_dot_atom . $local_dot_atom) =~ /[]<>():;@\\,"[]/;
# ^- note, the period is omitted from special class, and checked below.
return 0 if $local_dot_atom =~ /(^\.)|(\.\.)|(\.$)/;
return 0 if $domain_dot_atom =~ /(^\.)|(\.\.)|(\.$)/;
return ( $quoted_text . $local_dot_atom, $domain_literal, $domain_dot_atom );
Now, if you really want a single regex that does nearly all the above, Xavier Spriet wrote a fansatic article on this topic and provided a one-step regex that checks for most of this. He explains it in great detail in that article. In case it goes away, here's the final version:
(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?
The above is great, but I wouldn't use it:
Upvotes: 2
Reputation: 3968
You can create a function with regex /([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/
to extract email ids from long text
function extractEmails (text) {
return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/gi);
}
Script in action: Run to see result
var text = `boleh di kirim ke email saya [email protected] tks... boleh minta kirim ke [email protected]. [email protected]. .
[email protected] Senior Quantity Surveyor
[email protected], terimakasih bu Cindy Hartanto
[email protected] saya mau dong bu cindy
[email protected]
Hi Cindy ...pls share the Salary guide to [email protected] thank a`;
function extractEmails ( text ){
return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/gi);
}
$("#emails").text(extractEmails(text));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<p id="emails"></p>
-----Update-----
While the regex in the above code snippet matches most email patterns, but if you still need to match >99% of the email patterns, including the edge cases (like '+' in the email) then use the regex pattern as shown below
Script in action: Run to see result
var text = `boleh di kirim ke email saya [email protected] tks... boleh minta kirim ke [email protected]. [email protected]. .
[email protected] Senior Quantity Surveyor
[email protected], terimakasih bu Cindy Hartanto
[email protected] saya mau dong bu cindy
[email protected]
Hi Cindy ...pls share the Salary guide to [email protected] thank a`;
function extractEmails ( text ){
return text.match(/(?:[a-z0-9+!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])/gi);
}
$("#emails").text(extractEmails(text));
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.2.2/jquery.min.js"></script>
<p id="emails"></p>
Upvotes: 109
Reputation: 56
Using Python from my side work very well. try with yourself.
[a-z]+@[a-z]+.[a-z]+
Upvotes: 0
Reputation: 984
I would like to add to @Ambrish Pathak's answer,
According to wikipedia, an email address can also accept + sign
([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)
will work like a charm
Upvotes: 37
Reputation: 4523
You can use the following regex to capture all the email addresses.
(?<name>[\w.]+)\@(?<domain>\w+\.\w+)(\.\w+)?
additionally if you want, you can capture only those emails that contains a specific domain name (ie. some-url.com) and to achieve that you just need to replace the \w+\.\w+
part after <domain>
with your desired domain name. so, it would be like (?<name>[\w.]+)\@(?<domain>outlook.com)(\.\w+)?
Upvotes: 2
Reputation: 13704
[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+
worked for me, you can check the result on this regex101 saved regex.
It's really just twice the same pattern separated by an @
sign.
The pattern is 1 or more occurences of:
a-z
: any lowercase letterA-Z
: any uppercase letter0-9
: any digit-_.
: a hyphen, an underscore or a dotIf it missed some emails, add any missing character to it and it should do the trick.
Edit
I didn't notice it first, but when going to the regex101 link, there's an Explanation section at the top-right corner of the screen explaining what the regular expression matches.
Upvotes: 8