DataTsar
DataTsar

Reputation: 41

Parse Date Column using Regex / Perl

I am writing a perl script to parse tab delimited data from standard input.

The script removes leading and trailing whitespace, blanks out any field that has the string "NULL" and re-formats date columns from "MMM DD YYYY HH:MM:SS:SSSAM" format to "YYYYMMDD" format.

Sample Input:

93092   Apr  1 2010 12:00:00:000AM      59668370.60702875
22341   Apr  1 2010 12:00:00:000AM      51309196.84639429
27844   Apr  1 2010 12:00:00:000AM      NULL
150465  Apr 22 2010 12:00:00:000AM      19706190.97586569
119364  Jul  20 2010 12:00:00:000AM      16335977.41009162

Target Output:

93092|20100401|59668370.60702875
22341|20100401|51309196.84639429
27844|20100401|
150465|20100422|19706190.97586569
119364|20100720|16335977.41009162

The script takes an argument representing the column(s) which have dates that need conversion. In the sample above, I would invoke with "1" as the param, since the 2nd column is the date that needs conversion. More than one column would be represented by a comma separated list.

This is what I've been able to do so far.

#!/usr/bin/perl
my @date_cols = split(/,/, $ARGV[0]);

while (<STDIN>) {
   my @fields = split(/\t/, $_, -1);
   for (@fields) {
      s/^\s+//;
      s/\s+\z//;
      s/^NULL\z//;
   }
   for (@fields[@date_cols]) {
##NEED HELP WITH DATE FORMATTING
   }

   print(join('|', @fields), "\n");
}

Upvotes: 2

Views: 945

Answers (2)

JRFerguson
JRFerguson

Reputation: 7516

Using Time::Piece is simple and easily provides you with the date formmating. The strptime function lets you define the pattern you want to work with; the strftime function lets you produce the output format desired. Consider:

use Time::Piece;
my $date = "Apr  1 2012 12:00:00AM";
my $t = Time::Piece->strptime($date,"%b %d %Y %H:%M:%S%p");
print $t->strftime("%Y%m%d\n");

A nice feature of this approach is that it doesn't matter whether one or two spaces separate the month and day fields; the results are the same.

Upvotes: 1

zostay
zostay

Reputation: 3995

Based on Dave Cross's suggestion above to use Time::Piece:

use Time::Piece;

while (<STDIN>) {
    # Split each row into columns by white space
    my @fields = split /\s+/;

    # Rebuild the date ("Apr 1 2010") from columns 2 through 4
    my $time_field = join ' ', @fields[1..3];

    # Parse the date - see man strptime
    my $date = Time::Piece->strptime($time_field, '%B %d %Y');

    # Format the output - see man strftime
    print join '|', $fields[0], $date->strftime('%Y%m%d'), $fields[5];
}

Regex is a really excellent tool, but dates are ugly (horrifying even). Whenever possible I prefer to use libraries that already exist for parsing them.

Upvotes: 0

Related Questions