Sates
Sates

Reputation: 408

Extracting mail's content

I need to create an app that will extract VAT numbers that our clients send us for verification. They send nothing more with e-mails. That's for purpose of creating extended statistics.

What I need is to have a mail's body without any headers before the content I need, that is VAT number, as simple as that.

This is my script that creates the list of 30 recent e-mails:

<?
if (!function_exists('imap_open')) { die('No function'); }

if ($mbox = imap_open(<confidential>)) {
    $output = "";
    $messageCount = imap_num_msg($mbox);
    $x = 1;     
    for ($i = 0; $i < 30; $i++) {
        $message_id = ($messageCount - $i);
        $fetch_message = imap_header($mbox, $message_id);
        $mail_content = quoted_printable_decode(imap_fetchbody($mbox,$message_id, 1));
        iconv(mb_detect_encoding($mail_content, mb_detect_order(), true), "UTF-8", $mail_content);

        $output .= "<tr>
        <td>".$x.".</td>
        <td>
            ".$fetch_message->from[0]->mailbox."@".$fetch_message->from[0]->host."
        </td>
        <td>
            ".$fetch_message->date."
        </td>
        <td>
            ".$fetch_message->subject."
        </td>
        <td>
            <textarea cols=\"40\">".$mail_content."</textarea>
        </td>
        </tr>";
        $x++;
    }
    $smarty->assign("enquiries", $output);
    $smarty->display("module_mail");
    imap_close($mbox);
} else {
    print_r(imap_errors());
}
?>

I've worked with imap_fetchbody, imap_header and so on to retrieve the desired content but it turns out that most of e-mails have got something else (like headers) before the content, ie.

--=-Dbl2eWTUl0Km+Tj46Ww1
Content-Type: text/plain;

------=_NextPart_001_003A_01D14F7A.F25AB3D0
Content-Type: text/plain;

--=-ucRIRGamiKb0Ot1/AkNc
Content-Type: text/plain;

I need to get rid of everything that's before the VAT number included in the mail's message but I don't know how. Some emails don't have these headers, some do. And since we're working with clients from all over the Europe, it really confuses me and leaves powerless.

Another problem is that some clients just copy-paste VAT numbers from various websites and that means these VAT numbers are often pasted with the original style (bold/background/changed colour et cetera). That might be the reason for my PS below.

I would appreciate every help that'd lead me to solving this problem.

Thank you in advance.

PS. Just for a record. With imap_fetchbody($mbox,$message_id, 1) I need to use 1 to have the whole content. Changing 1 to anything else results in displaying NO email content at all. Literally.

Upvotes: 7

Views: 645

Answers (2)

Adam
Adam

Reputation: 18855

You have to use imap_fetchstructure() to find the plain text part of the mail.

The following code can give you the section number of the text/plain subpart (for instance "1.1")

 function getTextPart($struct) {
    if ($struct->type==0) return "1";
    if ($struct->type==1) {
            $num=1;
            foreach ($struct->parts as $part) {
                    if (($part->type==0)&&($part->subtype="PLAIN")) {
                            return $num;
                    } else if ($part->type==1) {
                            $found=getTextPart($part);
                            if ($found) return "$num.$found";
                    }
                    $num++;
            }
    }
    return NULL;
 }

Example of use:

if ($imap) {
    $messageCount = imap_num_msg($imap);
    for ($i = 1; $i < 30; $i++) {
            $struct=imap_fetchstructure($imap, $i);
            $part=getTextPart($struct);
            $body=imap_fetchbody($imap, $i, $part);
            print_r($body);
    }
 }

Upvotes: 0

borracciaBlu
borracciaBlu

Reputation: 4225

The part of the email that you define as "noise" are just part of the format of the email.
In some way is like you were reading the html code of a web page.

All those bits are boundaries. Those elements of the email are like tags in the html and like html they start and they close.

So in your case:

Content-Type: multipart/alternative; boundary="=-Dbl2eWTUl0Km+Tj46Ww1" // define type of email structure and boudary

--=-Dbl2eWTUl0Km+Tj46Ww1    // used to start the section
Content-Type: text/plain;   // to define the type of content of the section
// here there is your VAT presumbly

--=-Dbl2eWTUl0Km+Tj46Ww1--  // used to close the section

Possibles solutions

Actually you have at least 2 solutions.
Make a custom parser by yourself or use a PECL library called Mailparse.

Manually make a parser:

$mail_lines = explode($mail_content, "\n");

foreach ($mail_lines as $key => $line) {
     // jump most of the headrs
     if ($key < 5) {
         continue;
     }

     // skip tag lines
     if (strpos($line, "--")) {
        continue;
     }

     // skip Content lines
     if (strpos($line, "Content")) {
        continue;
     }

     if (empty(trim($line))) {
        continue;
     } 

     ////////////////////////////////////////////////////
     // here you have to insert the logic for the parser
     // and extend the guard clauses
     ////////////////////////////////////////////////////
}

Mailparse:

Install Mail parse sudo pecl install mailparse .

Extract the VAT :

$mail = mailparse_msg_create();
mailparse_msg_parse($mail, $mail_content);
$struct = mailparse_msg_get_structure($mail); 

foreach ($struct as $st) { 
    $section = mailparse_msg_get_part($mail, $st); 
    $info = mailparse_msg_get_part_data($section); 

    print_r($info);
}

Upvotes: 3

Related Questions