dariq
dariq

Reputation: 21

How can I capture an escaped " but not an unescaped one?

Suppose the portion that needs to be captured by regex is indicated by PORTION in the following string

,"PORTION","","a",["some_string"]  

Examples of PORTION are

so the strings actually look like

PORTION is surrounded by double quotes. Double quotes inside PORTION are escaped by backslash. My current pattern is

my $pattern = '(.?([\\"]|[^"][^,][^"])*)';

which produces the results for the above examples as follows

The pattern tries to match everything in front of a sequence that is not ","
and also allow the capturing of \"
But it's not working as intended. How can I make it work?

Upvotes: 2

Views: 375

Answers (6)

outis
outis

Reputation: 77420

Don't forget to allow for escaped backslashes along with escaped quotes. Using REs to matched balanced anything gets ugly fast:

/(?<=")((?:[^"\\]+|\\+[^"\\]|(?:\\\\)+|(?<!\\)\\(?:\\\\)*")*)(?=")/

Do yourself a favor and use a parser, as Ether suggests.

Upvotes: 1

Ether
Ether

Reputation: 53986

You're making it way too complicated; there's no rule that says you have to do all your parsing in one monolithic regex. Since your string looks like a comma-delimited sequence, first parse it as such:

my @fields = split /(?<!\\),/, $string;   # use comma as a delimiter (except when escaped)

...And then parse your first field accordingly:

shift @fields unless $fields[0];     # pull off the potentially null first field
$fields[0] =~ s/^"//g;               # remove the leading "
$fields[0] =~ s/(?<!\\)"$//g;        # remove the trailing " that isn't preceded by a \

You could parse all your fields this way by wrapping the above code in a for loop or map().

Note that this code does not account for such occurrences such as \\, (the comma is a valid delimiter here, even though it will pass through the regex incorrectly). Therefore, it would be much preferred to use a proper parser for your format (whatever it is). You may want to take a look at Text::CSV.

Upvotes: 5

xiechao
xiechao

Reputation: 2361

If you need to consider escaped backslashes as mentioned by outis, you can use this:

m/"((\\\\|\\"|[^"])+)"/

(It seems I can not leave comment on outis' answer, but outis solution does not work with this:

"abc\\\"123"

will produce

abc\\\

)

Input:

,"\"abc123","","a",["some_string"]
,"abc123\" ","","a",["some_string"]
"\"abc123\"","","a",["some_string"]
"abc\"123\"","","a",["some_string"]
"abc123","","a",["some_string"]
"ab\\c123","","a",["some_string"]
"abc123\\","","a",["some_string"]
"abc123\\\"","","a",["some_string"]
"abc\\\"123\"","","a",["some_string"]
"abc123\\\\\"","","a",["some_string"]

Output:

\"abc123
abc123\" 
\"abc123\"
abc\"123\"
abc123
ab\\c123
abc123\\
abc123\\\"
abc\\\"123\"
abc123\\\\\"

Upvotes: 0

Just use Text::CSV

Upvotes: 3

Carl Smotricz
Carl Smotricz

Reputation: 67790

Your problem calls for the infamous zero-width negative look-behind assertion

...which lets you match a foo that doesn't follow a bar.

The doc is here: http://perldoc.perl.org/perlre.html#Extended-Patterns

and you want something like this in your regexp:

"(.+?)(?<!\\)"

that matches a double quote, as few as possible of any char(s), then another double quote not preceded by a backslash (escaped by doubling, I think). The first set of parens captures as you intend, the second parentheses are not capturing.

Edit: Meanwhile tested using http://www.internetofficer.com/seo-tool/regex-tester/ and it seems to work fine.

Edit: As outis points out, this expression will not correctly match a PORTION in which the final character before the closing quote is an escaped backslash. If you don't anticipate backslashes in your text you should be fine though.

Upvotes: 1

ghostdog74
ghostdog74

Reputation: 342659

if your data is comma delimited and do not have embedded commas, just split on "," and get the appropriate fields

while(<>){
    chomp;
    @s = split /,/;
    if ($s[0] eq ""){
        print "$s[1]\n";
    }else{
        print $s[0]."\n";
    }
}

output

$ perl perl.pl file
"\"abc123"
"abc123\" "
"\"abc123\""
"abc\"123\""
"abc123"

Upvotes: 0

Related Questions