Reputation: 364
In this thread i would like to collect best practices and solutions how to encounter the issue of validating an email address, including international emails. There are a couple of ways, like structural checks, DNS lookup etc. But it seems, there are traps/edge cases along the way, which not everybody knows about. I hope you guys can help me collecting good links/code/tips, grouped by topic (e.g. server side, HTML preparation, ...).
Lets handle each area of interest in a separate answer.
If i use the term validation, i mean data validation. Wikipedia defines it:
[...] is the process of ensuring data have undergone data cleansing to ensure they have data quality, that is, that they are both correct and useful.
Source: https://en.wikipedia.org/wiki/Data_validation
Email address validation means, testing a string if its valid under the terms of RFC 5322. It is the latest version which describes the Internet Message Format used by emails. Reference: https://www.rfc-editor.org/rfc/rfc5322
That does not include checks, if email provider is valid (e.g. disposable emails) or if address makes sense (e.g. [email protected]) or if TLD is available.
An international email (ref) can contain all kinds of UTF-8 characters, which do not exist in ASCII.
Valid Examples based on the wiki article:
Upvotes: 3
Views: 1284
Reputation: 364
Not a duplicate: This answer collects known solutions to validate an email address. It also contains information about known limitations when checking international emails. In the end i provide a possible solution how to encounter international emails.
The author of this post, proposed the following function to validate an email:
function isValidEmail($email){
return filter_var($email, FILTER_VALIDATE_EMAIL) !== false;
}
If you require a TLD to be part of the address, the author also proposed:
function isValidEmail($email) {
return filter_var($email, FILTER_VALIDATE_EMAIL)
&& preg_match('/@.+\./', $email);
}
filter_var
does not cover international email addresses, which contain UTF-8 characters such as Greek or Russian.
Use custom regex to validate the structure. Good post with detailed description is here.
The author proposed a regex from http://emailregex.com/, which allows to check against the latest RDF 5322. The following code is the non-fixed version:
$regex = '/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}@)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*@(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/iD';
if (1 == \preg_match($regex, $email)) {
// email OK
}
He also mentioned:
[...] RFC 5322 leads to a regex that can be understood if studied for a few minutes and is efficient enough for actual use. [...]
This solution also not covers international addresses, which lead to no match.
DNS lookup is not a validation, but could complement the check. It works with all UTF-8 characters, if they form a valid internationalized domain name (Reference: https://en.wikipedia.org/wiki/Internationalized_domain_name).
[...] is an Internet domain name that contains at least one label that is displayed in software applications, [...], in a language-specific script or alphabet, such as Arabic, Chinese, Cyrillic, Tamil, Hebrew or the Latin alphabet-based characters with diacritics or ligatures, such as French.
Via checkdnsrr
you check if a given domain has a valid DNS record.
// $domain was extracted from the given email before
// $domain must end with a . (see comment below)
if (checkdnsrr($domain, 'MX') || checkdnsrr($domain, 'A') || checkdnsrr($domain, 'AAAA')) {
// domain is VALID
}
User Martin mentioned at php.net, that the domain must end with a .
to be considered valid. Without the point, you will get false positives.
Source: http://php.net/manual/en/function.checkdnsrr.php#119969
What I have seen so far, you need a combination of structural checks + DNS look up to get the best coverage. The first part of the following code is based on the class EmailAddress
from Genkgo Mail ( source ).
function mail_is_valid(string $address): bool {
$hits = \preg_match('/^([^@]+)@([^@]+)$/', $address, $matches);
if ($hits === 0) {
// email NOT valid
return false;
}
[$address, $localPart, $domain] = $matches;
$variant = INTL_IDNA_VARIANT_2003;
if (\defined('INTL_IDNA_VARIANT_UTS46') ) {
$variant = INTL_IDNA_VARIANT_UTS46;
}
$domain = \rtrim(\idn_to_ascii($domain, IDNA_DEFAULT, $variant), '.') . '.';
if (!\checkdnsrr($domain, 'MX')) {
return \checkdnsrr($domain, 'A') || \checkdnsrr($domain, 'AAAA');
} else {
return true;
}
}
I consider it the currently best solution, because the algorithm is mostly character agnostic, which allows UTF-8 characters in the email. That is valid, as long as you have a user-part + @
+ domain-part. The DNS lookup ensures the domain exists.
Its not optimal. If you know a better way, please post it as comment or solution.
Upvotes: 1