user3089455
user3089455

Reputation: 21

How to validate real, practical, commonly accepted email addresses (but including unicode)?

How to validate real, practical, commonly accepted email addresses (but including unicode)?

The topic of validating email addresses has come up a lot on SO, and it often gets distracted by the fact email addresses are theoretically allowed to contain a lot of "special" characters. Some say many email validators are too strict, but the fact is if GMail, facebook, Yahoo etc. are very strict, that means 99.999% of emails out there will conform to those standards which are more strict than what the RFC will allow, so that's the REAL world.

I've done a survey (14-01-19) of free email services and here are some of the most common restrictions on email names:

GMail: NOT allowed: ! " # $ % & ( ) * + , / : ; < = > ? @ [ \ ] ^ ` { | } ~

Yahoo.com: Only letters, numbers, underscores(_), and ONE dot (.) are allowed:

Zoho.com: Only letters, numbers, underscores(_), and dots (.) are allowed:

facebook.com: Only letters, numbers, underscores(_), dots (.), and hyphen (-) are allowed:

hushmail.com: Only letters, numbers, underscores(_), dots (.), and hyphen (-) are allowed:

AIM (AOL): Characters NOT allowed: @, !, * or $ (many others too, but not specified)

Hotmail/Outlook.com: Letters, numbers, _-. OK; no accented or non-Latin alphabet

iCloud.com: Typical Apple, you have to download a bunch of crap (67.5 MBytes) and let it invade your system before you can even create an account. I didn't bother.

The bottom line is that the vast majority of email services only allow letters, numbers, underscores(_), dots (.), and hyphen (-). Also, I know from reading about this on many sites that quite a few people use '+' in their email address.

So, I'd like a nice simple filter to screen out any emails that are invalid because of overall structure or because they use anything other than the simple characters used by the vast majority of people and/or accepted by most email services: A-Z a-z 0-9 _.+-

Unfortunately PHP's filter_var function with FILTER_VALIDATE_EMAIL/FILTER_SANITIZE_EMAIL allows these characters to get through: !#$%&'*/=?^`{|}~@[] and so I consider it virtually useless - especially since it allows the quote (') symbol. Only the most geeky nerds will use any of those symbols in their email address and if they do, they'll be rejected by the vast majority of sites, so they must have a more normal backup email address anyway.

One complication: I live in Vietnam and must allow for the possibility of unicode characters in the addresses. How can I do that?

Upvotes: 0

Views: 406

Answers (2)

Mehdi
Mehdi

Reputation: 4318

There are several ways for an email to be invalid. And it's not just in using characters. Validating some formats of string doesn't fit in Regular Expression. For instance having double dots (..) in a email address is invalid, which you should put it out the reqex.

You can take a look at Zend_Validate_EmailAddress. If you check the source code you will get the complexity of the problem.

// Split email address up and disallow '..'
if ((strpos($value, '..') !== false) or
    (!preg_match('/^(.+)@([^@]+)$/', $value, $matches))) {

    ...
}

$this->_localPart = $matches[1];
$this->_hostname  = $matches[2];

...

    $hostname = $this->_validateHostnamePart();
...
$local = $this->_validateLocalPart();

And in _validateLocalPart they do this:

// atext: ALPHA / DIGIT / and "!", "#", "$", "%", "&", "'", "*",
//        "+", "-", "/", "=", "?", "^", "_", "`", "{", "|", "}", "~"
$atext = 'a-zA-Z0-9\x21\x23\x24\x25\x26\x27\x2a\x2b\x2d\x2f\x3d\x3f\x5e\x5f\x60\x7b\x7c\x7d\x7e';
if (preg_match('/^[' . $atext . ']+(\x2e+[' . $atext . ']+)*$/', $this->_localPart)) {
    ...
}

Upvotes: 0

Pekka
Pekka

Reputation: 449485

The only correct way to validate an E-Mail address is to send a an E-Mail with a confirmation link to it.

If you feel so inclined, check for a general (string)@(string).(string) pattern to catch user mistakes and obvious bogus entries like lalalalala.

The filter_var function (that you already mention) does that.

Anything beyond that is a waste of time.

Upvotes: 4

Related Questions