luqita
luqita

Reputation: 4077

Parse email string with metadata and get the From and Cc values

I´m trying to get the email from and cc from a forwarded email, when the body looks like this:

$body = '-------
Begin forwarded message:


From: Sarah Johnson <[email protected]>

Subject: email subject

Date: February 22, 2013 3:48:12 AM

To: Email Recipient <[email protected]>

Cc: Ralph Johnson <[email protected]>


Hi,


hello, thank you and goodbye!

 [email protected]'

Now, when I do the following:

$body = strtolower($body);
$pattern = '#from: \D*\S([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4})\S#';
if (preg_match($pattern, $body, $arr_matches)) {
     echo htmlentities($arr_matches[0]);
     die();
}

I correctly get:

from: sarah johnson <[email protected]>

Now, why does the cc don't work? I do something very similar, only changing from to cc:

$body = strtolower($body);
$pattern = '#cc: \D*\S([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4})\S#';
if (preg_match($pattern, $body, $arr_matches)) {
     echo htmlentities($arr_matches[0]);
     die();
}

and I get:

cc: ralph johnson <[email protected]> hi, hello, thank you and goodbye! [email protected]

If I remove the email from the original body footer (removing [email protected]) then I correctly get:

cc: ralph johnson <[email protected]>

It looks like that email is affecting the regular expression. But how, and why doesn't it affect it in the from? How can I fix this?

Upvotes: 0

Views: 143

Answers (3)

mickmackusa
mickmackusa

Reputation: 47874

Parsing the email's metadata doesn't need to use a convoluted regex pattern.

Use ^ with the m pattern modifier to start matches from the beginning of any new line.

Match the start of a line before a colon with [^:\v]+. The \v in the negated character class prevents matching multiple lines.

For easiest Accessibility, form an associative array from the two captured values of each qualifying line.

Code: (Demo)

preg_match_all('/^([^:\v]+): *(.+)/m', $body, $m);
var_export(
    array_combine($m[1], $m[2])
);

Output:

array (
  'From' => 'Sarah Johnson <[email protected]>',
  'Subject' => 'email subject',
  'Date' => 'February 22, 2013 3:48:12 AM',
  'To' => 'Email Recipient <[email protected]>',
  'Cc' => 'Ralph Johnson <[email protected]>',
)

Upvotes: 0

Winston
Winston

Reputation: 1805

Try like this

$body = '-------
Begin forwarded message:


From: Sarah Johnson <[email protected]>

Subject: email subject

Date: February 22, 2013 3:48:12 AM

To: Email Recipient <[email protected]>

Cc: Ralph Johnson <[email protected]>


Hi,


hello, thank you and goodbye!

 [email protected]';

$pattern = '#(?:from|Cc):\s+[^<>]+<([^@]+@[^>\s]+)>#is';
preg_match_all($pattern, $body, $arr_matches);
echo '<pre>' . htmlspecialchars(print_r($arr_matches, 1)) . '</pre>';

Output

Array
(
    [0] => Array
        (
            [0] => From: Sarah Johnson <[email protected]>
            [1] => Cc: Ralph Johnson <[email protected]>
        )

    [1] => Array
        (
            [0] => [email protected]
            [1] => [email protected]
        )

)

$arr_matches[1][0] - "From" email
$arr_matches[1][1] - "Cc" email

Upvotes: 1

stema
stema

Reputation: 92976

The problem is, that \D* matches too much, i.e. it is also matching newline characters. I would be more restrictive here. Why do you use \D(not a Digit) at all?

With e.g. [^@]* it is working

cc: [^@]*\S([\w-\.]+)@((?:[\w]+\.)+)([a-zA-Z]{2,4})\S

See it here on Regexr.

This way, you are sure that this first part is not matching beyond the email address.

This \D is also the reason, it is working for the first, the "From" case. There are digits in the "Date" row, therefore it does not match over this row.

Upvotes: 3

Related Questions