Cignitor
Cignitor

Reputation: 1099

regex extract email from strings

I want to know if by using regular expressions I am able to extract emails from the following strings?

The following RE pattern is .*@.*match with all strings. It has worked fine with some of the string, though with not all.

I want to match all strings match with email pattern include all domain like (some-url.com) or (some-url.co.id)

boleh di kirim ke email saya [email protected] tks...
boleh minta kirim ke [email protected]. 
[email protected]. .
[email protected] Senior Quantity Surveyor
[email protected], terimakasih bu Cindy Hartanto
[email protected] saya mau dong bu cindy
[email protected] 
Hi Cindy ...pls share the Salary guide to [email protected] thank a

Upvotes: 46

Views: 163270

Answers (8)

Otheus
Otheus

Reputation: 1032

All answers here are fundamentally incorrect. Why are you trying to validate an email address? On the web, you should validate with a user-response anyway. The absolute worst thing you can do is to block a valid address. It's usually fine to permit an address that isn't matching perfectly, because that will simply be bounced down the line. However, you also don't want to allow things that can break your system: non-ASCII characters, and entries exceeding the length-limits.

According to RFC5322, the "local-part" (before @) may consist of any printable ASCII character, except "special characters"; or it may consist of a double-quoted string of any printable ASCII characters (excluding only the double-quote character unless preceded by \). Even this is a simplification, because it doesn't account for the "Comments-with-folding-space" headache of the RFC. Ignoring that last part, you can create a compliant regex simply with:

^(\S+?)@(\S+)$

That might be a little too permissive, but it won't restrict valid RFC mail addresses, like many twits are want to do. If it's invalid, it will simply bounce. And, it's simple.

If you want to be a bit more restrictive without becoming non-conformant, you can use the [:graph:] character class with length-limits:

^([[:graph:]]{1,64})@([[:graph:]]{1,255})$

Now you have something that is compliant, reasonably restrictive, and prevents buffer-overflows and nefarious activities. Just make sure your regex engine is not using UTF-8.

Now, according to the RFCs, the following characters are not allowed in some circumstances, defined as "special" characters:

  < > ( ) [ ] : ; @ \ , . "

Further, the special characters are allowed in the local-part when quoted, except for the quote character itself, and in the domain part when bracketed. So for full compliance with more opportunities to filter out bad ones, we can still do:

   ^(?:("(?:[^\\"]|\\.){1,62}")|([[:graph:]]{1,64}))@(?:(\[[^]\\[]{1,253}\])|([[:graph:]]{1,255}))))$

We can probably do better than this, but it becomes ugly (see the solution way below), and usually at the price of compliance.

After matching the above, we need to further check our captures:

  1. There are four distinct captures. The first ($1) is the quoted-text of the local-part (if it exists); the second ($2) is the dot-atom-text of the local-part (if no quoted text provided); the third ($3) is the domain-literal (if bracketed) and the fourth ($4) is the dot-atom-text of the domain-part (if not bracketed). Breaking it down this way allows us to more easily check the parts, because there are different rules for each.
  2. Only one of the first two parts will have a value in it. You can concatenate these two to get the local-part.
  3. Only one of the last two parts will have a value in it. Concatenate to get the domain-part.
  4. The quoted part may contain only characters of the [:graph:] character class, but that was too tricky to implement it correctly here. (Advanced double-extended perl regex's can combine character classes, but that's unlikely to be supported by other languages).
  5. The bracketed part (domain) must also only comprise characters of the [:graph:] character class. The ugliness here is to ensure it does not contain a backslash or embedded bracket character.
  6. Both the dot-atom-text parts (local and domain) may not contain special characters, except for the . (period) -- see below. Check separately for that. However, see 10.
  7. While both the dot-atom-text parts (local and domain) may contain the period ., neither may begin nor end with it, and there may not be two consecutive one. It must be checked separately (or the length is checked separately, one or the other).
  8. The total length of the local-part may not exceed 64 characters. This is noted in RFC5321, and is a requirement of the SMTP header. That's what the bracketed parts are for.
  9. The domain part must not exceed 255 characters. See above.
  10. Concerning the domain part, DO NOT try to further validate it. Let the host's resolver do that for you. For instance, DNS rejects hostnames with _ in it, but it's perfectly fine as a local-host definition in both Unix and Microsoft, and by extension, Docker.

So now you need to check the parts correctly:

$quoted_text = $1;
$local_dot_atom = $2;
$domain_literal = $3;
$domain_dot_atom = $4;
return 0 if ($quoted_text . $local_domain_literal) =~ /[^[:graph:]]/;
return 0 if ($domain_dot_atom . $local_dot_atom) =~ /[]<>():;@\\,"[]/;
# ^- note, the period is omitted from special class, and checked below.
return 0 if  $local_dot_atom =~ /(^\.)|(\.\.)|(\.$)/;
return 0 if $domain_dot_atom =~ /(^\.)|(\.\.)|(\.$)/;

return ( $quoted_text . $local_dot_atom, $domain_literal, $domain_dot_atom );

Now, if you really want a single regex that does nearly all the above, Xavier Spriet wrote a fansatic article on this topic and provided a one-step regex that checks for most of this. He explains it in great detail in that article. In case it goes away, here's the final version:

(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?

The above is great, but I wouldn't use it:

  1. It doesn't check the length limitation. It's more important to test that on your input string, to ensure there's no weird buffer-overflow attack going on. Instead, you should first check to make sure the entire string is less than 64 + 1 + 255 = 320 before using the expression, then checking each part.
  2. He's far too restrictive on the domain expression. This is easily fixed, but then I think the expression is horribly ugly. (Disallows bracketed IPv6 Addresses, as well as other characters permitted in general but not allowed by DNS (such as the underscore)).

Upvotes: 2

sushil khati
sushil khati

Reputation: 11

You can simply use regex as below:

[^\s]{0,}@[^\s]{0,}

Upvotes: -1

Ambrish Pathak
Ambrish Pathak

Reputation: 3968

You can create a function with regex /([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/ to extract email ids from long text

function extractEmails (text) {
  return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/gi);
}

Script in action: Run to see result

var text = `boleh di kirim ke email saya [email protected] tks... boleh minta kirim ke [email protected]. [email protected]. . 
[email protected] Senior Quantity Surveyor
[email protected], terimakasih bu Cindy Hartanto
[email protected] saya mau dong bu cindy
[email protected] 
Hi Cindy ...pls share the Salary guide to [email protected] thank a`; 

function extractEmails ( text ){
    return text.match(/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)/gi);
    }
     
    $("#emails").text(extractEmails(text));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<p id="emails"></p>

-----Update-----

While the regex in the above code snippet matches most email patterns, but if you still need to match >99% of the email patterns, including the edge cases (like '+' in the email) then use the regex pattern as shown below

Script in action: Run to see result

var text = `boleh di kirim ke email saya [email protected] tks... boleh minta kirim ke [email protected]. [email protected]. . 
[email protected] Senior Quantity Surveyor
[email protected], terimakasih bu Cindy Hartanto
[email protected] saya mau dong bu cindy
[email protected] 
Hi Cindy ...pls share the Salary guide to [email protected] thank a`; 

function extractEmails ( text ){
    return text.match(/(?:[a-z0-9+!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])/gi);
    }
     
    $("#emails").text(extractEmails(text));
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.2.2/jquery.min.js"></script>
<p id="emails"></p>

Upvotes: 109

Shivam Baldha
Shivam Baldha

Reputation: 56

Using Python from my side work very well. try with yourself.

[a-z]+@[a-z]+.[a-z]+

Upvotes: 0

Sanjeev Siva
Sanjeev Siva

Reputation: 984

I would like to add to @Ambrish Pathak's answer,

According to wikipedia, an email address can also accept + sign

([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)

will work like a charm

Upvotes: 37

Vamshidhar H.K.
Vamshidhar H.K.

Reputation: 306

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}+\.[A-Z]{2,}

Upvotes: 0

m87
m87

Reputation: 4523

You can use the following regex to capture all the email addresses.

(?<name>[\w.]+)\@(?<domain>\w+\.\w+)(\.\w+)?

see demo / explanation

additionally if you want, you can capture only those emails that contains a specific domain name (ie. some-url.com) and to achieve that you just need to replace the \w+\.\w+ part after <domain> with your desired domain name. so, it would be like (?<name>[\w.]+)\@(?<domain>outlook.com)(\.\w+)?

see demo / explanation

Upvotes: 2

Micka&#235;l Derriey
Micka&#235;l Derriey

Reputation: 13704

[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+ worked for me, you can check the result on this regex101 saved regex.

It's really just twice the same pattern separated by an @ sign.

The pattern is 1 or more occurences of:

  • a-z: any lowercase letter
  • A-Z: any uppercase letter
  • 0-9: any digit
  • -_.: a hyphen, an underscore or a dot

If it missed some emails, add any missing character to it and it should do the trick.

Edit

I didn't notice it first, but when going to the regex101 link, there's an Explanation section at the top-right corner of the screen explaining what the regular expression matches.

Upvotes: 8

Related Questions