BryanK
BryanK

Reputation: 1231

Parsing CSV data using a regex in Perl

I have a CSV file where each row looks something like this:

509,,SOME VALUE,0,1,1,0.23

I am attempting to find all numbers that are two or more digits that may or may not be followed or preceded by a comma and then put them in an array by using this Perl code:

my $file ='somefile.csv';

open my $DATA , "<", $file;
$_ = do {local $/; <$DATA>};
my @A = /,?(\d{2,}),?/g;
close $DATA;

As expected it is matching the first comma separated value in the row above but also it is matching the 23 portion of the last value, 0.23. I would expect this not to match because of the ..

Could someone help me with making my regex more specific so it will not find the numbers before or after the period too?

Upvotes: 0

Views: 352

Answers (1)

Borodin
Borodin

Reputation: 126772

It is often unwise to press regular expreesions into doing too much in a program. It is easy to end up with convoluted and incomprehensible code that could have been implemented much more simply with standard Perl.

Slurping the whole file into memory at once also makes this problem more awkward than it needs to be. Reading the file line by line is usually the best and most efficient way.

I suggest you write something like this. It reads each line, trims the newline from the end, and uses split to separate it into fields. Then all those fields that match your criterion - two or more decimal digits - are filtered out using grep and pushed onto the array @numbers.

use strict;
use warnings;

my $file ='somefile.csv';

open my $data , '<', $file;
my @numbers;
while (<$data>) {
  chomp;
  push @numbers, grep /^\d{2,}$/, split /,/;
}
close $data;

print "$_\n" for @numbers;

output

509

If you insist on following your current plan, then this alternative program will also work. But I hope you see that it is far less clear than my first suggestion.

use strict;
use warnings;

my $file ='somefile.csv';

my $data = do {
  open my $fh, '<', $file;
  local $/;
  <$fh>;
};

my @numbers = $data =~ /(?:,|^)\K(\d{2,})(?=,|$)/gm;
print "$_\n" for @numbers;

Upvotes: 2

Related Questions