Junitar
Junitar

Reputation: 999

Match a character only when not in a quoted field or in between 2 quotes

Consider the following character strings:

"bla ; bla"; bla
"bla "";"" bla"; bla
"bla ";" bla"; bla

I'm trying to match any ; that is not in a quoted field (e.g. "bla ; bla") or in between 2 quotes.

In other words, I would like to match the second ; in the first 2 strings and all ; in the last string.

Here are the 2 regex I've been trying but I can't manage to make one that works in all cases.

^(['"])(?:(?!\1).)*\1(?=;)(*SKIP)(*F)|;
^(['"])(?:(?!(?!\1)\1).)*\1(?=;)(*SKIP)(*F)|;

Any idea?

EDIT

I omitted several important details in my initial question. The example lines above are from .csv files. I'm trying to extract all file separators ; in lines from different files. The problem I have is to distinguish between a quoted ; inside a quoted field (line 2) and two quoted fields separated by ; (line 3). A quoted field is always followed by ; in my case.

Upvotes: 5

Views: 178

Answers (3)

Shawn
Shawn

Reputation: 52539

Use an actual CSV parser (Well, Semicolon-SV) like Text::CSV_XS instead of trying to hack up something with regular expressions:

#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new({ binary => 1, sep_char => ";"});

while (my $row = $csv->getline(\*DATA)) {
    say $row->[0];
}


__DATA__
"bla ; bla"; bla
"bla "";"" bla"; bla
"bla ";" bla"; bla

Upvotes: 5

Bohemian
Bohemian

Reputation: 425208

Simplest (AFAIK shortest) and widely supported (doesn’t use SKIP, which isn’t commonly supported):

(?<!"");(?!"")(?=((?:[^"]*"){2})*[^"]*$)

See live demo.

It works by using look arounds to assert:

  • not wrapped in double quotes
  • followed by an even number (including zero) of quotes

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

You can use

(?:"[^"]*(?:""[^"]*)*"|'[^']*(?:''[^']*)*')(?<!;["'])(*SKIP)(*F)|;

See the regex demo. Details:

  • (?:"[^"]*(?:""[^"]*)*"|'[^']*(?:''[^']*)*') - a non-capturing group matching either
    • "[^"]*(?:""[^"]*)*" - a ", then any zero or more chars other than a " char, then zero or more occurrences of a "" string and then any zero or more chars other than a " char, and then a " again
    • | - or
    • '[^']*(?:''[^']*)*' - a ', then any zero or more chars other than a ' char, then zero or more occurrences of a '' string and then any zero or more chars other than a ' char, and then a ' again
  • (?<!;["']) - a negative lookbehind that fails the match if there is ; and ' or " immediately to the left of the current location
  • (*SKIP)(*F) - fail the match and start the search for the next match from the failure position
  • | - or
  • ; - a semi-colon.

Upvotes: 2

Related Questions