H Foucault
H Foucault

Reputation: 57

REGEX match - look ahead assertions

I am trying to parse a file and capture all the "sent" dates. The format of the dates vary, so I am looking for patterns and adding the format ( to give to Time::Piece::strptime ) . My date patterns are as follows "Mon Nov 13 12:34:10 2006" or "Tuesday, November 14, 2006 10:58 AM". I used a look ahead assertion to see if the end was AM or PM and handle the two cases I was reading the file line by line and writing the following code:

    print "$2<-\n" if $line =~ /(Sent):\s*([^\n]+)(?<=AM|PM)$/;
    print "$2<- \n" if $line =~ /(Sent):\s*([^\n]+)(?<!AM|PM)$/; 

The problem I ran into was that sometimes I have a whitespace at the end of the lines before the newlines. Like so "Tuesday, November 14, 2006 10:58 AM ", or "Mon Nov 13 12:34:10 2006 ". I can't figure out how to write the look ahead and see if does or doesnot have an AM or PM and then a possible space at the end. It ends up matching both times. I know I could break the loop ( put a proper block when matching and then break out with a "next" once I match the first one ) but I really want to understand what the regex engine is doing. Also, why does $2 contain the AM and PM ? Thanks

Upvotes: 0

Views: 69

Answers (2)

Polar Bear
Polar Bear

Reputation: 6808

Does the following code extract dates correctly?

use strict;
use warnings;

use feature 'say';

my $pattern = qr/Sent:\s+(.*?)\s*$/;

my $date;

while( <DATA> ) {
    next if /^$/;

    $date = undef;
    $date = $1 if /$pattern/;

    say "[$date]" if $date;
}

__DATA__
Sent: Mon Nov 13 12:34:10 2006 
Sent: Fri Apr 13 12:34:10 2007 
Sent: Sat Jun 13 12:34:10 2009 
Some extra line to skip
Sent: Tuesday, November 14, 2006 10:58 AM 
Sent: Monday, November 16, 2006  6:20 AM 
Sent: Thursday, November 17, 2006  8:18 PM 
Other extra line to skip
Sent: Wednesday, December 4,  2006  1:06 PM 

output

[Mon Nov 13 12:34:10 2006]
[Fri Apr 13 12:34:10 2007]
[Sat Jun 13 12:34:10 2009]
[Tuesday, November 14, 2006 10:58 AM]
[Monday, November 16, 2006  6:20 AM]
[Thursday, November 17, 2006  8:18 PM]
[Wednesday, December 4,  2006  1:06 PM]

Upvotes: 0

ikegami
ikegami

Reputation: 386396

if ( my ($sent) = $line =~ /Sent:\s*(.*)/ ) {
   print "= $sent\n" if $sent =~ /[AP]M\s*$/;
   print "! $sent\n" if $sent !~ /[AP]M\s*$/;
}

Upvotes: 1

Related Questions