Reputation: 387
I have a mailbox file containing over 50 megs of messages separated by something like this:
From - Thu Jul 19 07:11:55 2007
I want to build a regular expression for this in Java to extract each mail message one at a time, so I tried using a Scanner, using the following pattern as the delimiter:
public boolean ParseData(DataSource data_source) {
boolean is_successful_transfer = false;
String mail_header_regex = "^From\\s";
LinkedList<String> ip_addresses = new LinkedList<String>();
ASNRepository asn_repository = new ASNRepository();
try {
Pattern mail_header_pattern = Pattern.compile(mail_header_regex);
File input_file = data_source.GetInputFile();
//parse out each message from the mailbox
Scanner scanner = new Scanner(input_file);
while(scanner.hasNext(mail_header_pattern)) {
String current_line = scanner.next(mail_header_pattern);
Matcher mail_matcher = mail_header_pattern.matcher(current_line);
//read each mail message and extract the proper "received from" ip address
//to put it in our list of ip's we can add to the database to prepare
//for querying.
while(mail_matcher.find()) {
String message_text = mail_matcher.group();
String ip_address = get_ip_address(message_text);
//empty ip address means the line contains no received from
if(!ip_address.trim().isEmpty())
ip_addresses.add(ip_address);
}
}//next line
//add ip addresses from mailbox to database
is_successful_transfer = asn_repository.AddIPAddresses(ip_addresses);
}
//error reading file--unsuccessful transfer
catch(FileNotFoundException ex) {
is_successful_transfer = false;
}
return is_successful_transfer;
}
This seems like it should work, but whenever I run it, the program hangs, probably due to it not finding the pattern. This same regular expression works in Perl with the same file, but in Java it always hangs on the String current_line = scanner.next(mail_header_pattern);
Is this regular expression correct or am I parsing the file incorrectly?
Upvotes: 0
Views: 908
Reputation: 425438
I'd be leaning toward something much simpler, by just reading lines, something like this:
while(scanner.hasNextLine()) {
String line = scanner.nextLine();
if (line.matches("^From\\s.*")) {
// it's a new email
} else {
// it's still part of the email body
}
}
Upvotes: 1