azz0r
azz0r

Reputation: 3311

Mass of text, cherry pick email addresses

I have a large file full of text and in there are some email addresses.

Which php regular expression function would return an array of email addresses it could find?

So far I have

<?php

$pattern = "/^[^@]*@[^@]*\.[^@]*$/";

if ($handle = opendir('files')) {

/* This is the correct way to loop over the directory. */
while (false !== ($file = readdir($handle))) {
   preg_match($pattern, $file, $matches);

   echo count($matches);
   foreach ($matches as $email) {
     echo "$email <br />";
   }
}

closedir($handle);
}

but it returns no results

Upvotes: 1

Views: 2367

Answers (7)

azz0r
azz0r

Reputation: 3311

Worthy of note, after scouring google for regex, with my script, here are the patterns I collected:

    $pattern = "^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$";
$pattern = "/([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)@([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)/i";
$pattern = '#([^@]+@[-a-z0-9.]+)#';
$pattern = '(^|\s|<)[a-zA-Z]([.+-]?\w+)+@(\w{2,}\.)+\w{2,5}($|\s|>)';
$pattern = "^[a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$";
$pattern = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?";
$pattern = "(^|\s|<)[a-zA-Z]([.+-]?\w+)+@(\w{2,}\.)+\w{2,5}($|\s|>)";

The best pattern is:

$pattern = "/([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)@([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)/i";

Upvotes: 3

Amarghosh
Amarghosh

Reputation: 59451

Try this one:

(^|\s|<)[a-zA-Z]([.+-]?\w+)+@(\w{2,}\.)+\w{2,5}($|\s|>)

Add other possible delimiters to the starting and ending groups ^|\s|<

Upvotes: 0

azz0r
azz0r

Reputation: 3311

Final code, that works perfect, thanks everyone :)

<?php

set_time_limit('0');
$pattern = "^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$";

if ($handle = opendir('files')) {
    while (false !== ($file = readdir($handle))) {
        $content = file_get_contents('files/'.$file);
        preg_match_all('#([^@]+@[-a-z0-9.]+)#', $content, $matches);
        echo count($matches[1]).' - '.$file.'<br />';
    }
    closedir($handle);
}

Upvotes: 0

Dancrumb
Dancrumb

Reputation: 27539

There are a number of sites talking about regexes for email addresses. This one in particular is quite expansive.

The short answer is that the definition of a 'valid' email address does not lend itself to a simple regex. Most practical regular expressions for email addresses trade completeness for simplicity.

Upvotes: 0

codaddict
codaddict

Reputation: 455020

Try something like:

$file = file_get_contents('filename.txt');
if(preg_match_all('#([^@]+@[-a-z0-9.]+)#',$file,$matches)) {
  $emails = $matches[1]; // array of all the emails in the file.
}

The regex is simplified and not a 100% RFC822 implementation.

EDIT:

The readdir function returns the filename on success not the file contents. You can try doing:

while (false !== ($file = readdir($handle))) {
   $file_contents = file_get_contents($file);
   if(preg_match_all('#([^@]+@[-a-z0-9.]+)#', $file_content, $matches)) {

     echo count($matches[1]);
     foreach ($matches[1] as $email) {
       echo "$email <br />";
   }
}

Upvotes: 1

benzado
benzado

Reputation: 84328

I see three problems:

  1. In regular expressions, ^ means the start of a line (or string) and $ means the end of a line (or string), that is probably why the pattern you are using doesn't work. It would only find an email address on a line by itself.

  2. You are passing the file's name to preg_match; it is expecting a string to be searched. You need to call file_get_contents or something like it to pass the file's text to the function.

  3. You need to use preg_match_all to find more than one match at a time, if there are multiple addresses in each file.

Upvotes: 1

Gordon
Gordon

Reputation: 316969

Read through

You can adapt the Regex given there or any other Regex you can find on the web for this purpose and then simply do a

preg_match_all($pattern, $someString, $matches);

$matches will then contain whatever was found for the Regex you used.

In case your file is too big to be loaded into memory, consider iterating over it with fgets().

Upvotes: 0

Related Questions