secJ
secJ

Reputation: 509

Perl Split combined with additional REGEX

My question is two-fold:

Scenario: I am generating a CSV log file that uses a standard comma delimiter. There are currently four "columns" in the CSV file, but more could be added in the future. I am interested in limiting my REGEX Search/Replace to a specific column of data (e.g., COL 2).

The data in column two varies based on the responding device. Ultimately, I'm using REGEXs to "condition" the value to a floating number to two decimal places. Some entries come in with various artifacts (e.g. extra letters or other representations) that I convert to the appropriate floating point representation using REGEX. All of the REGEXs work well for my purposes right now, but I want to limit these REGEXs to only data in column two (COL 2), so if I add similarly formatted columns (e.g., floating point numbers) in the future they're not effected by the column 2 REGEXs.

One way I thought of doing this is using split and saving column data into into scalar variables, then running the COL2 scalar through a loop or sub-routine containing the REGEXs and then re-writing the conditioned COL2 value back out to a new CSV file. (I'm currently doing this with Perl's $^I variable.)

$^I = ".org";

while (<>) { 
  my ($col1, $col2, $col3, $col4) = (split /,/);
  $col2 =~ s/EXP1/FORMATTED/;
  $col2 =~ s/EXP2/FORMATTED/;
  $col2 =~ s/EXP3/FORMATTED/;

  my $new_rec = join ",", $col1,$col2,$col3,$col4;

  print $new_rec;
}

So Question 1: I'm wondering if this is efficient enough or if I'm doing unnecessary work? Can I add the commas in the REGEX specifically such that it only works against COL2?

Question 2: In response to a question re: split (SPLIT Question), brian d foy says the following regarding Text::CSV_XS: "It's extremely fast, which is why I said "extremely optimized". Read its docs. It also handles CSV correctly, which split doesn't."

If this is accurate, meaning split doesn't correctly handle CSV files, will my above solution work over time? I'm not sure what he means by split doesn't handle CSV correctly.

Upvotes: 1

Views: 149

Answers (2)

Kenosis
Kenosis

Reputation: 6204

Echoing what toolic mentioned, it's risky to just split CSV data, since you may erringly split a column value. Given this, here's a Text::CSV_XS option:

use strict;
use warnings;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1 } )
  or die "Cannot use CSV: " . Text::CSV->error_diag();
my $sepChar = $csv->sep_char();

open my $fh, "<:encoding(utf8)", $ARGV[0] or die "$ARGV[0]: $!";
while ( my $row = $csv->getline($fh) ) {
    $row->[2] =~ s/this/that/;
    print join ',', map { /$sepChar/ ? qq{"$_"} : $_ } @$row;
}

$csv->eof or $csv->error_diag();
close $fh;

Command-line usage: >perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Note that you can do all of your s/// on col 2. The map includes a check for a separator character (usually a comma) in the array element. If one exists, the element's enclosed in double quotes--to maintain the CSV formatting.

Hope this helps!

Edit:

Since you're certain there are no commas in your CSV fields, you really don't need to split the data, do the substitutions, reassemble the data, and then print it. You can set up a hash whose key/value pairs are the match/substitute pairs used in substitutions. Then, just use a regex to capture the col2 value for a substitution:

use strict;
use warnings;

my %hash = ( 1 => '1.00', 'unk' => '0.00' );

while (<DATA>) {
    s/^(?:.+?,)\K([^,]+)/exists $hash{$1} ?  $hash{$1} : $1/e;
    print;
}

__DATA__
12345,1,342,789.0
47.42,unk,17.6,12
17,34,12.5,0

Output:

12345,1.00,342,789.0
47.42,0.00,17.6,12
17,34,12.5,0

Upvotes: 1

Borodin
Borodin

Reputation: 126722

It depends on where your data comes from. If you are certain that the fields will never be in quotes then split is fine, and is the best solution.

Here is how I would write your code. The for loop temporarily aliases $_ to $fields[1] for the extent of the block, and allows you to manipulate the value without expressly mentioning the variable. The output is sent to STDOUT.

use strict;
use warnings;

while (<>) {
  chomp;

  my @fields = split /,/;
  for ($fields[1]) {
    s/EXP1/FORMATTED/;
    s/EXP2/FORMATTED/;
    s/EXP3/FORMATTED/;
  }

  print join(',', @fields), "\n";
}

Upvotes: 0

Related Questions