k00ni
k00ni

Reputation: 364

Validate international email addresses with PHP

Motivation

In this thread i would like to collect best practices and solutions how to encounter the issue of validating an email address, including international emails. There are a couple of ways, like structural checks, DNS lookup etc. But it seems, there are traps/edge cases along the way, which not everybody knows about. I hope you guys can help me collecting good links/code/tips, grouped by topic (e.g. server side, HTML preparation, ...).

Lets handle each area of interest in a separate answer.

Meaning of validation

If i use the term validation, i mean data validation. Wikipedia defines it:

[...] is the process of ensuring data have undergone data cleansing to ensure they have data quality, that is, that they are both correct and useful.

Source: https://en.wikipedia.org/wiki/Data_validation

Email address validation

Email address validation means, testing a string if its valid under the terms of RFC 5322. It is the latest version which describes the Internet Message Format used by emails. Reference: https://www.rfc-editor.org/rfc/rfc5322

That does not include checks, if email provider is valid (e.g. disposable emails) or if address makes sense (e.g. [email protected]) or if TLD is available.

Not covered by most validators: International email addresses

An international email (ref) can contain all kinds of UTF-8 characters, which do not exist in ASCII.

Valid Examples based on the wiki article:

Upvotes: 3

Views: 1284

Answers (1)

k00ni
k00ni

Reputation: 364

Not a duplicate: This answer collects known solutions to validate an email address. It also contains information about known limitations when checking international emails. In the end i provide a possible solution how to encounter international emails.

filter_var

The author of this post, proposed the following function to validate an email:

function isValidEmail($email){ 
    return filter_var($email, FILTER_VALIDATE_EMAIL) !== false;
}

If you require a TLD to be part of the address, the author also proposed:

function isValidEmail($email) {
    return filter_var($email, FILTER_VALIDATE_EMAIL) 
        && preg_match('/@.+\./', $email);
}

Problem: No support for international email addresses

filter_var does not cover international email addresses, which contain UTF-8 characters such as Greek or Russian.


preg_match

Use custom regex to validate the structure. Good post with detailed description is here.

The author proposed a regex from http://emailregex.com/, which allows to check against the latest RDF 5322. The following code is the non-fixed version:

$regex = '/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}@)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*@(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/iD';

if (1 == \preg_match($regex, $email)) {
   // email OK
}

He also mentioned:

[...] RFC 5322 leads to a regex that can be understood if studied for a few minutes and is efficient enough for actual use. [...]

Problem: No support for international email addresses

This solution also not covers international addresses, which lead to no match.


Optional: DNS lookup

DNS lookup is not a validation, but could complement the check. It works with all UTF-8 characters, if they form a valid internationalized domain name (Reference: https://en.wikipedia.org/wiki/Internationalized_domain_name).

[...] is an Internet domain name that contains at least one label that is displayed in software applications, [...], in a language-specific script or alphabet, such as Arabic, Chinese, Cyrillic, Tamil, Hebrew or the Latin alphabet-based characters with diacritics or ligatures, such as French.

Via checkdnsrr you check if a given domain has a valid DNS record.

// $domain was extracted from the given email before
// $domain must end with a . (see comment below)

if (checkdnsrr($domain, 'MX') || checkdnsrr($domain, 'A') || checkdnsrr($domain, 'AAAA')) {
    // domain is VALID
}

User Martin mentioned at php.net, that the domain must end with a . to be considered valid. Without the point, you will get false positives.

Source: http://php.net/manual/en/function.checkdnsrr.php#119969


Handle international emails

Possible solution 1: structural check + DNS look up

What I have seen so far, you need a combination of structural checks + DNS look up to get the best coverage. The first part of the following code is based on the class EmailAddress from Genkgo Mail ( source ).

function mail_is_valid(string $address): bool {
    $hits = \preg_match('/^([^@]+)@([^@]+)$/', $address, $matches);

    if ($hits === 0) {
        // email NOT valid
        return false;
    }

    [$address, $localPart, $domain] = $matches;

    $variant = INTL_IDNA_VARIANT_2003;
    if (\defined('INTL_IDNA_VARIANT_UTS46') ) {
        $variant = INTL_IDNA_VARIANT_UTS46;
    }

    $domain = \rtrim(\idn_to_ascii($domain, IDNA_DEFAULT, $variant), '.') . '.';

    if (!\checkdnsrr($domain, 'MX')) {
        return \checkdnsrr($domain, 'A') || \checkdnsrr($domain, 'AAAA');
    } else {
        return true;
    }
}

I consider it the currently best solution, because the algorithm is mostly character agnostic, which allows UTF-8 characters in the email. That is valid, as long as you have a user-part + @ + domain-part. The DNS lookup ensures the domain exists.

Its not optimal. If you know a better way, please post it as comment or solution.

Upvotes: 1

Related Questions