Reputation: 6093

Multi-order splitting inside Perl

I have a string which comes from a CSV file:

my $str = 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168,rs16997168,rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';

which should be translated (somehow) to

'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168;rs16997168;rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';

so that perl's split does not split the single field GSA-rs16997168,rs16997168 into two separate fields

i.e. the comma should be replaced by a semi-colon if it is between the two " I can't find how to do this on Google

What I've tried so far:

$str =~ s/"([^"]+),([^"]+)"/"$1;$2"/g; but this fails with > 2 expressions
It would be great if I could somehow tell perl's split function to count everything within "" as one field even if that text has the , delimiter, but I don't know how to do that :(
I've heard of lookaheads, but I don't see how I can use them here :(

Upvotes: 0

Answers (3)

user557597

Reputation:

Why use a CSV module and a regex.
Just use a regex and cut out the middle man .

$str =~ s/(?m:(?:,|^)"|(?!^)\G)[^",]*\K,(?=[^"]*")/;/g ;

https://regex101.com/r/tRDCen/1

Read-me version

 (?m:
      (?: , | ^ )
      "
   |  
      (?! ^ )
      \G 
 )
 [^",]* 
 \K 
 ,
 (?= [^"]* " )

Upvotes: 0

ikegami

Reputation: 385647

Why try to recreate a CSV parser when perfectly good ones exist?

use Text::CSV_XS qw( );

my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 });
while ( my $row = $csv->get_line($fh) ) {
   $row->[5] =~ s/,/;/g
   $csv->say(\*STDOUT, $row);
}

Upvotes: 8

Emma

Reputation: 27723

My guess is that we wish to capture upto four commas after the last ", for which we would be starting with a simple expression such as:

(.*",.+?,.+?,.+?,.+?),

Demo

Test

use strict;

my $str = 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168,rs16997168,rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';
my $regex = qr/(.*",.+?,.+?,.+?,.+?),/mp;

if ( $str =~ /$regex/g ) {
  print "Whole match is ${^MATCH} and its start/end positions can be obtained via \$-[0] and \$+[0]\n";
  # print "Capture Group 1 is $1 and its start/end positions can be obtained via \$-[1] and \$+[1]\n";
  # print "Capture Group 2 is $2 ... and so on\n";
}

# ${^POSTMATCH} and ${^PREMATCH} are also available with the use of '/p'
# Named capture groups can be called via $+{name}

RegEx

If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.

RegEx Circuit

jex.im visualizes regular expressions:

Upvotes: 1