Deepak Shrestha
Deepak Shrestha

Reputation: 763

How to validate non-english (UTF-8) encoded email address in Javascript and PHP?

Part of a website I am currently working on contains registration process where users have to provide their email address. Just recently I became aware that non-ascii based domains are possible (so is email). My backend is utf-8 encoded MySQL where I am expecting any users (with differnt locales) should be able to enter their email but don't know how to validate this kind of email address.

Currently I am testing out jquery tools and it validates the english email address correctly but fails to validate non ascii email. Also I need to do same at server side with php. Is there a regular expression that can validate this kind of email address?

I have tried this but it fails in jquery tools (this is just example for demo, I don't understand this too)

闪闪发光@闪闪发光.com

Also what will happen when they type their English email address ([email protected]) with their own IME. Can this be validated with current regular expression we have for English mail validation. Currently I don't have to worry if that email exist for not.

Thanks

Upvotes: 12

Views: 14028

Answers (7)

Ilia Ross
Ilia Ross

Reputation: 13412

As offered by Mario, playing around a bit, I came up with the following regex to validate non-standard email address:

^([\p{L}\_\.\-\d]+)@([\p{L}\-\.\d]+)((\.(\p{L}){2,63})+)$

It would validate any proper email address with all kind of Unicode letters, with TLD limitations from 2 to 63 characters.

Please check it and let me know if there are any flaws.

Example Online

Upvotes: 2

powtac
powtac

Reputation: 41050

Since 5.2 PHP has a build in validation for email addresses. But I'm not sure if it works for UFT-8 encoded strings:

echo filter_var($email, FILTER_VALIDATE_EMAIL);

In the original PHP source code you will find the reg exp for validating email, this can be used for manually validating when using PHP < 5.2.

Update

idn_to_ascii() can be used to "Convert domain name to IDNA ASCII form." Which then can be validated with filter_var($email, FILTER_VALIDATE_EMAIL);

// International domains
if (function_exists('idn_to_ascii') && strpos($email, '@') !== false) {
    $parts = explode('@', $email);
    $email = $parts[0].'@'.idn_to_ascii($parts[1]);
}
$is_valid = filter_var($email, FILTER_VALIDATE_EMAIL);

Upvotes: 2

Synchro
Synchro

Reputation: 37750

On this subject I liked this page so much that I set up a blog exposing sites that do validation wrong (contributions gratefully received - don't let yours be on it!).

As far as using regexes go, those that say "it's wrong", tend to be light on alternatives, and TBH validation to the last letter of the RFC isn't really that critical - for example while noddy+!#$%&'*-/=?+_{}|[email protected] is a perfectly valid address, it's not too unreasonable to reject it given that a surprisingly large proportion of users can't even type 'hotmail' correctly. Some domains are also quite restrictive on user names anyway, particularly hotmail. So I'm in favour of regexes that are demonstrably reasonable, and my favourite source for that is this page, though I don't like their current JS 'winner' and it would help if they set up a public test page.

jQuery's validate plugin uses this regex which is interestingly constructed, quite similar in style (but smaller!) to the ex-parrot one (actually my ISP!) linked by @powtac .

Upvotes: -1

Deepak Shrestha
Deepak Shrestha

Reputation: 763

Got this idea from Javascript tutorial page. It is basic but it works for me without worrying about complexity of regular expressions and unicode standards.

Client side validation

if(!$.trim(value).length) {
    return false;
}
else {

    AtPos = value.indexOf("@");
    StopPos = value.lastIndexOf(".");

    if (AtPos == -1 || StopPos == -1) {
        return false;
    }

    if (StopPos < AtPos) {
        return false;
    }

    if (StopPos - AtPos == 1) {
        return false;
    }

    return true;
}

Serverside validation

if(!isset($_POST['emailaddr']) || trim($_POST['emailaddr']) == "") {
    //Error: Email required
}
else {
    $atpos = strpos($_POST['emailaddr'],'@');
    $stoppos = strpos($_POST['emailaddr'],'.');

    if(($atpos === false) || ($stoppos === false)) {
        //Error: invalid email
    }
    else {
        if($stoppos < $atpos) {
            //Error: invalid email
        }
        else {
            if (($stoppos-$atpos) == 1) {
            //Error: invalid email
        }
    }
}

Though it still has some loop holes, I guess users will not be fooling around with this stuff. Also real validation is requierd for serious stuff as suggested by 'Jeremy Banks'.

Hope this will be helpful for somebody else too.

Thanks and regards to all

Upvotes: 0

The Bndr
The Bndr

Reputation: 13394

what is about something this:

mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
mb_ereg('[\w]+@[\w]+\.com',$mail,'UTF-8');

Upvotes: -3

Attempting to validate email addresses may not be a good idea. The specifications (RFC5321, RFC5322) allow for so much flexibility that validating them with regular expressions is literally impossible, and validating with a function is a great deal of work. The result of this is that most email validation schemes end up rejecting a large number of valid email addresses, much to the inconvenience of the users. (By far the most common example of this is not allowing the + character.)

It is more likely that the user will (accidentally or deliberately) enter an incorrect email address than in an invalid one, so actually validating is a great deal of work for very little benefit, with possible costs if you do it incorrectly.

I would recommend that you just check for the presence of an @ character on the client and then send a confirmation email to verify it; it's the most practical way to validate and it confirms that the address is correct as well.

Upvotes: 15

powtac
powtac

Reputation: 41050

a reg exp could be something like this:

[^ ]+@[^ ]+\.[^ ]{2,6}

Upvotes: 0

Related Questions