Austin
Austin

Reputation: 387

Regular Expression for Mail Header Message

I have a mailbox file containing over 50 megs of messages separated by something like this:

From - Thu Jul 19 07:11:55 2007

I want to build a regular expression for this in Java to extract each mail message one at a time, so I tried using a Scanner, using the following pattern as the delimiter:

public boolean ParseData(DataSource data_source) {

    boolean is_successful_transfer = false;
    String mail_header_regex = "^From\\s";
    LinkedList<String> ip_addresses = new LinkedList<String>();
    ASNRepository asn_repository = new ASNRepository();

    try {       

    Pattern mail_header_pattern = Pattern.compile(mail_header_regex);

    File input_file = data_source.GetInputFile();

    //parse out each message from the mailbox
    Scanner scanner = new Scanner(input_file);      

    while(scanner.hasNext(mail_header_pattern)) {


    String current_line = scanner.next(mail_header_pattern);

    Matcher mail_matcher = mail_header_pattern.matcher(current_line);

        //read each mail message and extract the proper "received from" ip address 
        //to put it in our list of ip's we can add to the database to prepare
        //for querying.
        while(mail_matcher.find()) {
            String message_text = mail_matcher.group();                 
            String ip_address = get_ip_address(message_text);

            //empty ip address means the line contains no received from
            if(!ip_address.trim().isEmpty()) 
                ip_addresses.add(ip_address);
        }

    }//next line

        //add ip addresses from mailbox to database 
        is_successful_transfer = asn_repository.AddIPAddresses(ip_addresses);           

    }

    //error reading file--unsuccessful transfer
    catch(FileNotFoundException ex) {
        is_successful_transfer = false;
    }

    return is_successful_transfer;

}

This seems like it should work, but whenever I run it, the program hangs, probably due to it not finding the pattern. This same regular expression works in Perl with the same file, but in Java it always hangs on the String current_line = scanner.next(mail_header_pattern);

Is this regular expression correct or am I parsing the file incorrectly?

Upvotes: 0

Views: 908

Answers (1)

Bohemian
Bohemian

Reputation: 425438

I'd be leaning toward something much simpler, by just reading lines, something like this:

while(scanner.hasNextLine()) {
    String line = scanner.nextLine();
    if (line.matches("^From\\s.*")) {
       // it's a new email
    } else {
       // it's still part of the email body
    }
}

Upvotes: 1

Related Questions