Email spam classification extracting features from header

Question

I am trying to build a spam classifier. I have been reading some of the research papers and along with adding content based features I am also trying to add header field features for e.g. number of BCC recipients, subject, sender etc, however I am stuck at one particular place:

I need to check the legitimacy of the sender domain address. I am writing all my code in R and I am not very sure how can I check that using R.
I am also trying to extract the X-Mailer field, which is not a difficult task. However, the problem with X-mailer is that if its not present then it is a good indication that the email is a spam, however the problem arises when spammers try to obfuscate the X-mailer and fill it with garbled text, how can I differentiate between these two types of data - garbled X-mailer content and legitimate X-mailer.
Similarly I am trying to create features like: "domain_legality" legality of the domain of the sender,"date_time_legality" the legality of the date and time the message was created and received, "IP_legality" IP of receiver, and "sender_legality", which are some what self explanatory.

I thank you for your time and consideration.

So here is a sample of my code what I am trying to do:

extract_header <- function(email.data){
  header.features <- data.frame(matrix(ncol = 13))
  email.regex <- "[[:alnum:].-]+@[[:alnum:].-]+" #regular expression to extract from email address
  colnames(header.features) <- c("rec_field_num_of_hops", "span_time", "domain_legality", "date_time_legality", "IP_legality", "sender_legality", "num_of_To_receivers", "num_of_CC_receivers", "num_of_BCC_receivers", "mail_agent", "email_subject", "date_received")
  for(i in 1:length(email.data)){
    #extracting the email address of the sender
    header.features$sender_legality[i] = str_match(email.data[[i]]$meta$author, email.regex)

    #the subject of the email
    header.features$email_subject[i] = email.data$meta$heading

    #number of To receipients of the email
    posToField = which(!is.na(str_match(email.data[[i]]$meta$header, ignore.case("^To:"))))
    if(length(posToField) > 0)
      header.features$num_of_To_receivers[i]  = sum(str_count(email.data[[i]]$meta$header[posToField], email.regex))
    else
      header.features$num_of_To_receivers[i]  = 0

    #number of people CC in the email
    posCCField = which(!is.na(str_match(email_corpus[[i]]$meta$header, ignore.case("^Cc:"))))
    if(length(posCCField) > 0)
      header.features$num_of_CC_receivers[i] = sum(str_count(email.data[[i]]$meta$header[posCCField], email.regex))
    else
      header.features$num_of_CC_receivers[i] = 0

    #number of the Bcc people in the email
    posBccField = which(!is.na(str_match(email_corpus[[i]]$meta$header, ignore.case("^Bcc:"))))
    if(length(posBccField) > 0)
      header.features$num_of_BCC_receivers[i] = sum(str_count(email.data[[i]]$meta$header[posBccField], email.regex))
    else
      header.features$num_of_BCC_receivers[i] = 0

    #number of email servers hopped by
    header.features$rec_field_num_of_hops[i] <- sum(str_count(email_corpus[[i]]$meta$header, "^Received: from"))

  }
}

I am following the approach laid out in the research papers:

A scalable intelligent non-content-based spam-filtering framework
Identifying Potentially Useful Email Header Features for Email Spam Filtering

I need to check if the sender of the email was a legit sender, the rationale behind doing this is that most of the times spammers spoof their email address, and this particular feature helps in identifying whether the email is a spam or not.

Header:

From rpm-list-admin@freshrpms.net  Tue Oct  8 10:56:20 2002
Return-Path: 
Delivered-To: zzzz@localhost.example.com
Received: from localhost (jalapeno [127.0.0.1])
    by example.com (Postfix) with ESMTP id 79DB116F16
    for ; Tue,  8 Oct 2002 10:56:20 +0100 (IST)
Received: from jalapeno [127.0.0.1]
    by localhost with IMAP (fetchmail-5.9.0)
    for zzzz@localhost (single-drop); Tue, 08 Oct 2002 10:56:20 +0100 (IST)
Received: from egwn.net (ns2.egwn.net [193.172.5.4]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g988mPK07565 for
    ; Tue, 8 Oct 2002 09:48:25 +0100
Received: from auth02.nl.egwn.net (localhost [127.0.0.1]) by egwn.net
    (8.11.6/8.11.6/EGWN) with ESMTP id g988i1f16827; Tue, 8 Oct 2002 10:44:02
    +0200
Received: from chip.ath.cx (cs146114.pp.htv.fi [213.243.146.114]) by
    egwn.net (8.11.6/8.11.6/EGWN) with ESMTP id g988hGf13093 for
    ; Tue, 8 Oct 2002 10:43:16 +0200
Received: from chip.ath.cx (localhost [127.0.0.1]) by chip.ath.cx
    (8.12.5/8.12.2) with ESMTP id g988hASA018848 for ;
    Tue, 8 Oct 2002 11:43:10 +0300
Received: from localhost (pmatilai@localhost) by chip.ath.cx
    (8.12.5/8.12.5/Submit) with ESMTP id g988h9j2018844 for
    ; Tue, 8 Oct 2002 11:43:10 +0300
X-Authentication-Warning: chip.ath.cx: pmatilai owned process doing -bs
From: Panu Matilainen 
X-X-Sender: pmatilai@chip.ath.cx
To: rpm-zzzlist@freshrpms.net
Subject: Re: a problem with apt-get
In-Reply-To: 
Message-Id: 
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Mailscanner: Found to be clean, Found to be clean
Sender: rpm-zzzlist-admin@freshrpms.net
Errors-To: rpm-zzzlist-admin@freshrpms.net
X-Beenthere: rpm-zzzlist@freshrpms.net
X-Mailman-Version: 2.0.11
Precedence: bulk
Reply-To: rpm-zzzlist@freshrpms.net
List-Help: 
List-Post: 
List-Subscribe: ,
    
List-Id: Freshrpms RPM discussion list 
List-Unsubscribe: ,
    
List-Archive: 
X-Original-Date: Tue, 8 Oct 2002 11:43:09 +0300 (EEST)
Date: Tue, 8 Oct 2002 11:43:09 +0300 (EEST)

I hope these additional details help out. Thank you for help :).

etov · Accepted Answer

The question is quite general, but I'll try giving some advice.

First, you should consider structuring your classifier hierarchically. That is: build separate classifiers to handle specific problems, e.g. the legality of various parameters like date, x-mailer, etc.

In the context of each of these sub-classifiers you'll be able to use domain-knowledge and debug your code much more easily than when addressing all of these problems together.

For instance, let's focus on separating garbled text from legitimate x-mailers.

Looking at a bunch of examples, you can probably get some insights as to what to look for in order to identify garbage. For instance: field length, character distribution (which will probably be more even for garbled text), a list of known valid x-mailers, etc.

Based on these insights you can build a classifier just for that: extract the relevant features, train, test, etc.

Once you've solved this problem to your satisfaction, you can use the output of this classifier as an input to the more general spam filter. If you do that, it might be a good idea to let this sub-classifier extract a numeric measure of confidence, and not just a boolean, so that the general classifier would have more information do decide upon.

Another option at this point would be to add the features you've found to be working to the set of features of the more general classifier, and let it use them - along with other features - for classification.

This approach can potentially better account for more complex interactions between your features.

Email spam classification extracting features from header

Answers (1)

Related Questions