Reputation: 3311
I have a large file full of text and in there are some email addresses.
Which php regular expression function would return an array of email addresses it could find?
So far I have
<?php
$pattern = "/^[^@]*@[^@]*\.[^@]*$/";
if ($handle = opendir('files')) {
/* This is the correct way to loop over the directory. */
while (false !== ($file = readdir($handle))) {
preg_match($pattern, $file, $matches);
echo count($matches);
foreach ($matches as $email) {
echo "$email <br />";
}
}
closedir($handle);
}
but it returns no results
Upvotes: 1
Views: 2367
Reputation: 3311
Worthy of note, after scouring google for regex, with my script, here are the patterns I collected:
$pattern = "^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$";
$pattern = "/([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)@([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)/i";
$pattern = '#([^@]+@[-a-z0-9.]+)#';
$pattern = '(^|\s|<)[a-zA-Z]([.+-]?\w+)+@(\w{2,}\.)+\w{2,5}($|\s|>)';
$pattern = "^[a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$";
$pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?";
$pattern = "(^|\s|<)[a-zA-Z]([.+-]?\w+)+@(\w{2,}\.)+\w{2,5}($|\s|>)";
The best pattern is:
$pattern = "/([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)@([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)/i";
Upvotes: 3
Reputation: 59451
Try this one:
(^|\s|<)[a-zA-Z]([.+-]?\w+)+@(\w{2,}\.)+\w{2,5}($|\s|>)
Add other possible delimiters to the starting and ending groups ^|\s|<
Upvotes: 0
Reputation: 3311
Final code, that works perfect, thanks everyone :)
<?php
set_time_limit('0');
$pattern = "^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$";
if ($handle = opendir('files')) {
while (false !== ($file = readdir($handle))) {
$content = file_get_contents('files/'.$file);
preg_match_all('#([^@]+@[-a-z0-9.]+)#', $content, $matches);
echo count($matches[1]).' - '.$file.'<br />';
}
closedir($handle);
}
Upvotes: 0
Reputation: 27539
There are a number of sites talking about regexes for email addresses. This one in particular is quite expansive.
The short answer is that the definition of a 'valid' email address does not lend itself to a simple regex. Most practical regular expressions for email addresses trade completeness for simplicity.
Upvotes: 0
Reputation: 455020
Try something like:
$file = file_get_contents('filename.txt');
if(preg_match_all('#([^@]+@[-a-z0-9.]+)#',$file,$matches)) {
$emails = $matches[1]; // array of all the emails in the file.
}
The regex is simplified and not a 100% RFC822 implementation.
EDIT:
The readdir function returns the filename on success not the file contents. You can try doing:
while (false !== ($file = readdir($handle))) {
$file_contents = file_get_contents($file);
if(preg_match_all('#([^@]+@[-a-z0-9.]+)#', $file_content, $matches)) {
echo count($matches[1]);
foreach ($matches[1] as $email) {
echo "$email <br />";
}
}
Upvotes: 1
Reputation: 84328
I see three problems:
In regular expressions, ^
means the start of a line (or string) and $
means the end of a line (or string), that is probably why the pattern you are using doesn't work. It would only find an email address on a line by itself.
You are passing the file's name to preg_match
; it is expecting a string to be searched. You need to call file_get_contents
or something like it to pass the file's text to the function.
You need to use preg_match_all
to find more than one match at a time, if there are multiple addresses in each file.
Upvotes: 1
Reputation: 316969
Read through
You can adapt the Regex given there or any other Regex you can find on the web for this purpose and then simply do a
preg_match_all($pattern, $someString, $matches);
$matches
will then contain whatever was found for the Regex you used.
In case your file is too big to be loaded into memory, consider iterating over it with fgets().
Upvotes: 0