Reputation: 488
I am trying to build a spam classifier. I have been reading some of the research papers and along with adding content based features I am also trying to add header field features for e.g. number of BCC recipients, subject, sender etc, however I am stuck at one particular place:
I thank you for your time and consideration.
So here is a sample of my code what I am trying to do:
extract_header <- function(email.data){
header.features <- data.frame(matrix(ncol = 13))
email.regex <- "[[:alnum:].-]+@[[:alnum:].-]+" #regular expression to extract from email address
colnames(header.features) <- c("rec_field_num_of_hops", "span_time", "domain_legality", "date_time_legality", "IP_legality", "sender_legality", "num_of_To_receivers", "num_of_CC_receivers", "num_of_BCC_receivers", "mail_agent", "email_subject", "date_received")
for(i in 1:length(email.data)){
#extracting the email address of the sender
header.features$sender_legality[i] = str_match(email.data[[i]]$meta$author, email.regex)
#the subject of the email
header.features$email_subject[i] = email.data$meta$heading
#number of To receipients of the email
posToField = which(!is.na(str_match(email.data[[i]]$meta$header, ignore.case("^To:"))))
if(length(posToField) > 0)
header.features$num_of_To_receivers[i] = sum(str_count(email.data[[i]]$meta$header[posToField], email.regex))
else
header.features$num_of_To_receivers[i] = 0
#number of people CC in the email
posCCField = which(!is.na(str_match(email_corpus[[i]]$meta$header, ignore.case("^Cc:"))))
if(length(posCCField) > 0)
header.features$num_of_CC_receivers[i] = sum(str_count(email.data[[i]]$meta$header[posCCField], email.regex))
else
header.features$num_of_CC_receivers[i] = 0
#number of the Bcc people in the email
posBccField = which(!is.na(str_match(email_corpus[[i]]$meta$header, ignore.case("^Bcc:"))))
if(length(posBccField) > 0)
header.features$num_of_BCC_receivers[i] = sum(str_count(email.data[[i]]$meta$header[posBccField], email.regex))
else
header.features$num_of_BCC_receivers[i] = 0
#number of email servers hopped by
header.features$rec_field_num_of_hops[i] <- sum(str_count(email_corpus[[i]]$meta$header, "^Received: from"))
}
}
I am following the approach laid out in the research papers:
I need to check if the sender of the email was a legit sender, the rationale behind doing this is that most of the times spammers spoof their email address, and this particular feature helps in identifying whether the email is a spam or not.
Header:
From [email protected] Tue Oct 8 10:56:20 2002
Return-Path: <[email protected]>
Delivered-To: [email protected]
Received: from localhost (jalapeno [127.0.0.1])
by example.com (Postfix) with ESMTP id 79DB116F16
for <zzzz@localhost>; Tue, 8 Oct 2002 10:56:20 +0100 (IST)
Received: from jalapeno [127.0.0.1]
by localhost with IMAP (fetchmail-5.9.0)
for zzzz@localhost (single-drop); Tue, 08 Oct 2002 10:56:20 +0100 (IST)
Received: from egwn.net (ns2.egwn.net [193.172.5.4]) by
dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g988mPK07565 for
<[email protected]>; Tue, 8 Oct 2002 09:48:25 +0100
Received: from auth02.nl.egwn.net (localhost [127.0.0.1]) by egwn.net
(8.11.6/8.11.6/EGWN) with ESMTP id g988i1f16827; Tue, 8 Oct 2002 10:44:02
+0200
Received: from chip.ath.cx (cs146114.pp.htv.fi [213.243.146.114]) by
egwn.net (8.11.6/8.11.6/EGWN) with ESMTP id g988hGf13093 for
<[email protected]>; Tue, 8 Oct 2002 10:43:16 +0200
Received: from chip.ath.cx (localhost [127.0.0.1]) by chip.ath.cx
(8.12.5/8.12.2) with ESMTP id g988hASA018848 for <[email protected]>;
Tue, 8 Oct 2002 11:43:10 +0300
Received: from localhost (pmatilai@localhost) by chip.ath.cx
(8.12.5/8.12.5/Submit) with ESMTP id g988h9j2018844 for
<[email protected]>; Tue, 8 Oct 2002 11:43:10 +0300
X-Authentication-Warning: chip.ath.cx: pmatilai owned process doing -bs
From: Panu Matilainen <[email protected]>
X-X-Sender: [email protected]
To: [email protected]
Subject: Re: a problem with apt-get
In-Reply-To: <[email protected]>
Message-Id: <[email protected]>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Mailscanner: Found to be clean, Found to be clean
Sender: [email protected]
Errors-To: [email protected]
X-Beenthere: [email protected]
X-Mailman-Version: 2.0.11
Precedence: bulk
Reply-To: [email protected]
List-Help: <mailto:[email protected]?subject=help>
List-Post: <mailto:[email protected]>
List-Subscribe: <http://lists.freshrpms.net/mailman/listinfo/rpm-zzzlist>,
<mailto:[email protected]?subject=subscribe>
List-Id: Freshrpms RPM discussion list <rpm-zzzlist.freshrpms.net>
List-Unsubscribe: <http://lists.freshrpms.net/mailman/listinfo/rpm-zzzlist>,
<mailto:[email protected]?subject=unsubscribe>
List-Archive: <http://lists.freshrpms.net/pipermail/rpm-zzzlist/>
X-Original-Date: Tue, 8 Oct 2002 11:43:09 +0300 (EEST)
Date: Tue, 8 Oct 2002 11:43:09 +0300 (EEST)
I hope these additional details help out. Thank you for help :).
Upvotes: 0
Views: 1493
Reputation: 3032
The question is quite general, but I'll try giving some advice.
First, you should consider structuring your classifier hierarchically. That is: build separate classifiers to handle specific problems, e.g. the legality of various parameters like date, x-mailer, etc.
In the context of each of these sub-classifiers you'll be able to use domain-knowledge and debug your code much more easily than when addressing all of these problems together.
For instance, let's focus on separating garbled text from legitimate x-mailers.
Looking at a bunch of examples, you can probably get some insights as to what to look for in order to identify garbage. For instance: field length, character distribution (which will probably be more even for garbled text), a list of known valid x-mailers, etc.
Based on these insights you can build a classifier just for that: extract the relevant features, train, test, etc.
Once you've solved this problem to your satisfaction, you can use the output of this classifier as an input to the more general spam filter. If you do that, it might be a good idea to let this sub-classifier extract a numeric measure of confidence, and not just a boolean, so that the general classifier would have more information do decide upon.
Another option at this point would be to add the features you've found to be working to the set of features of the more general classifier, and let it use them - along with other features - for classification.
This approach can potentially better account for more complex interactions between your features.
Upvotes: 1