Reputation: 31780
After looking for a good email validation routine, I found this answer to a similar question and decided that it looked like the most likely candidate. I implemented the following class for email validation (The RegexMatch class it inherits from validates a string against a regular expression as provided in the 'needle' key of an associative configuration array):
class Email extends RegexMatch implements iface\Prop
{
const
/**
* Regular expression for validating email addresses
*
* This regex is meant to validate against RFC 5322 and was taken from
* a post on Stack Overflow regarding email validation (see the links)
*
* @link http://www.ietf.org/rfc/rfc5322.txt, https://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses/1917982#1917982
*/
PATTERN = '
/(?(DEFINE)
(?<address> (?&mailbox) | (?&group))
(?<mailbox> (?&name_addr) | (?&addr_spec))
(?<name_addr> (?&display_name)? (?&angle_addr))
(?<angle_addr> (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
(?<group> (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ;
(?&CFWS)?)
(?<display_name> (?&phrase))
(?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*)
(?<addr_spec> (?&local_part) \@ (?&domain))
(?<local_part> (?&dot_atom) | (?"ed_string))
(?<domain> (?&dot_atom) | (?&domain_literal))
(?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
\] (?&CFWS)?)
(?<dcontent> (?&dtext) | (?"ed_pair))
(?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
(?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&\'*+-\/=?^_`{|}~])
(?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?)
(?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
(?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*)
(?<text> [\x01-\x09\x0b\x0c\x0e-\x7f])
(?<quoted_pair> \\ (?&text))
(?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
(?<qcontent> (?&qtext) | (?"ed_pair))
(?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
(?&FWS)? (?&DQUOTE) (?&CFWS)?)
(?<word> (?&atom) | (?"ed_string))
(?<phrase> (?&word)+)
# Folding white space
(?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
(?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
(?<ccontent> (?&ctext) | (?"ed_pair) | (?&comment))
(?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
(?<CFWS> (?: (?&FWS)? (?&comment))*
(?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
# No whitespace control
(?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
(?<ALPHA> [A-Za-z])
(?<DIGIT> [0-9])
(?<CRLF> \x0d \x0a)
(?<DQUOTE> ")
(?<WSP> [\x20\x09])
)
(?&address)/x';
public function setConfig (array $config = array ())
{
$config = array_merge ($config, array ('needle' => self::PATTERN));
return (parent::setConfig ($config));
}
public function isValid ()
{
return ((is_null ($this -> getData ()))
|| (parent::isValid ()));
}
}
I also built a PHPUnit test that runs this class against various permutations of valid and invalid email addresses culled from various sources (mostly Wikipedia).
The class seems to function in a lot of more mundane cases, but it's running into issues in that it passes some emails that are supposed to be invalid, and fails some that are supposed to be okay. I've listed them below:
much."more\ unusual"@example.com
(Fails, supposed to be valid)"(),:;<>[\]@example.com
(Passes, supposed to be invalid)just"not"[email protected]
(Passes, supposed to be invalid)A@b@[email protected]
(Passes, supposed to be invalid)this\ is\"really\"not\\[email protected]
(Passes, supposed to be invalid)PHP seems to parse the regex correctly, it doesn't emit any errors, warnings or notices. Also, all my other test cases (7 other valid addresses and 2 other invalid) are passed or failed as they should be, so I doubt it's because my version of PHP (5.3.8) doesn't support the regex syntax being used here. But as I've got both false positives and false negatives there's obviously something wrong. Either my test data is incorrect (which as I said I mostly culled from Wikipedia), or the regex as is is incorrect in some way.
Is the regex as entered above correct? If not, what corrections need to be made? If it is correct, then is there something wrong with my test cases?
EDIT: I also forgot to mention, as this is a validation class it needs to only pass strings that contain an email address and nothing else. I don't want to pass strings that contain a valid email address within non-email address data. I know you do that by using ^pattern_goes_here$
but this regular expression is rather more advanced than most I've worked with in the past, and I'm not sure where the ^ and $ should go. If you could also help with that I'd appreciate it.
Upvotes: 0
Views: 276
Reputation: 145512
If you want to add ^
and $
anchors, this would be the place:
^(?&address)$ /x';
You also need to verify your email test case resources. I would trust those regex subroutines more, as someone wrote it by translating the BNF declarations from the RFC.
Upvotes: 1
Reputation: 6431
Fully validating email addresses is a very tricky business.
Here's a list, complete with tests, that show different ways to tackle it, but none of them will pass all cases.
http://fightingforalostcause.net/misc/2006/compare-email-regex.php
The expression with the best score is currently the one used by PHP's filter_var(), which is based on a regex by Michael Rushton
I strongly suggest you use filter_var()
Upvotes: 2