Reputation: 999
Consider the following character strings:
"bla ; bla"; bla
"bla "";"" bla"; bla
"bla ";" bla"; bla
I'm trying to match any ;
that is not in a quoted field (e.g. "bla ; bla"
) or in between 2 quotes.
In other words, I would like to match the second ;
in the first 2 strings and all ;
in the last string.
Here are the 2 regex I've been trying but I can't manage to make one that works in all cases.
^(['"])(?:(?!\1).)*\1(?=;)(*SKIP)(*F)|;
^(['"])(?:(?!(?!\1)\1).)*\1(?=;)(*SKIP)(*F)|;
Any idea?
EDIT
I omitted several important details in my initial question. The example lines above are from .csv
files. I'm trying to extract all file separators ;
in lines from different files. The problem I have is to distinguish between a quoted ;
inside a quoted field (line 2) and two quoted fields separated by ;
(line 3). A quoted field is always followed by ;
in my case.
Upvotes: 5
Views: 178
Reputation: 52539
Use an actual CSV parser (Well, Semicolon-SV) like Text::CSV_XS
instead of trying to hack up something with regular expressions:
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new({ binary => 1, sep_char => ";"});
while (my $row = $csv->getline(\*DATA)) {
say $row->[0];
}
__DATA__
"bla ; bla"; bla
"bla "";"" bla"; bla
"bla ";" bla"; bla
Upvotes: 5
Reputation: 425208
Simplest (AFAIK shortest) and widely supported (doesn’t use SKIP
, which isn’t commonly supported):
(?<!"");(?!"")(?=((?:[^"]*"){2})*[^"]*$)
See live demo.
It works by using look arounds to assert:
Upvotes: 3
Reputation: 627103
You can use
(?:"[^"]*(?:""[^"]*)*"|'[^']*(?:''[^']*)*')(?<!;["'])(*SKIP)(*F)|;
See the regex demo. Details:
(?:"[^"]*(?:""[^"]*)*"|'[^']*(?:''[^']*)*')
- a non-capturing group matching either
"[^"]*(?:""[^"]*)*"
- a "
, then any zero or more chars other than a "
char, then zero or more occurrences of a ""
string and then any zero or more chars other than a "
char, and then a "
again|
- or'[^']*(?:''[^']*)*'
- a '
, then any zero or more chars other than a '
char, then zero or more occurrences of a ''
string and then any zero or more chars other than a '
char, and then a '
again(?<!;["'])
- a negative lookbehind that fails the match if there is ;
and '
or "
immediately to the left of the current location(*SKIP)(*F)
- fail the match and start the search for the next match from the failure position|
- or;
- a semi-colon.Upvotes: 2