Reputation: 21
Suppose the portion that needs to be captured by regex is indicated by PORTION in the following string
,"PORTION","","a",["some_string"]
Examples of PORTION are
so the strings actually look like
PORTION is surrounded by double quotes. Double quotes inside PORTION are escaped by backslash. My current pattern is
my $pattern = '(.?([\\"]|[^"][^,][^"])*)';
which produces the results for the above examples as follows
The pattern tries to match everything in front of a sequence that is not ","
and also allow the capturing of \"
But it's not working as intended.
How can I make it work?
Upvotes: 2
Views: 375
Reputation: 77420
Don't forget to allow for escaped backslashes along with escaped quotes. Using REs to matched balanced anything gets ugly fast:
/(?<=")((?:[^"\\]+|\\+[^"\\]|(?:\\\\)+|(?<!\\)\\(?:\\\\)*")*)(?=")/
Do yourself a favor and use a parser, as Ether suggests.
Upvotes: 1
Reputation: 53986
You're making it way too complicated; there's no rule that says you have to do all your parsing in one monolithic regex. Since your string looks like a comma-delimited sequence, first parse it as such:
my @fields = split /(?<!\\),/, $string; # use comma as a delimiter (except when escaped)
...And then parse your first field accordingly:
shift @fields unless $fields[0]; # pull off the potentially null first field
$fields[0] =~ s/^"//g; # remove the leading "
$fields[0] =~ s/(?<!\\)"$//g; # remove the trailing " that isn't preceded by a \
You could parse all your fields this way by wrapping the above code in a for loop or map().
Note that this code does not account for such occurrences such as \\,
(the comma is a valid delimiter here, even though it will pass through the regex incorrectly). Therefore, it would be much preferred to use a proper parser for your format (whatever it is). You may want to take a look at Text::CSV.
Upvotes: 5
Reputation: 2361
If you need to consider escaped backslashes as mentioned by outis, you can use this:
m/"((\\\\|\\"|[^"])+)"/
(It seems I can not leave comment on outis' answer, but outis solution does not work with this:
"abc\\\"123"
will produce
abc\\\
)
Input:
,"\"abc123","","a",["some_string"] ,"abc123\" ","","a",["some_string"] "\"abc123\"","","a",["some_string"] "abc\"123\"","","a",["some_string"] "abc123","","a",["some_string"] "ab\\c123","","a",["some_string"] "abc123\\","","a",["some_string"] "abc123\\\"","","a",["some_string"] "abc\\\"123\"","","a",["some_string"] "abc123\\\\\"","","a",["some_string"]
Output:
\"abc123 abc123\" \"abc123\" abc\"123\" abc123 ab\\c123 abc123\\ abc123\\\" abc\\\"123\" abc123\\\\\"
Upvotes: 0
Reputation: 67790
Your problem calls for the infamous zero-width negative look-behind assertion
...which lets you match a foo
that doesn't follow a bar
.
The doc is here: http://perldoc.perl.org/perlre.html#Extended-Patterns
and you want something like this in your regexp:
"(.+?)(?<!\\)"
that matches a double quote, as few as possible of any char(s), then another double quote not preceded by a backslash (escaped by doubling, I think). The first set of parens captures as you intend, the second parentheses are not capturing.
Edit: Meanwhile tested using http://www.internetofficer.com/seo-tool/regex-tester/ and it seems to work fine.
Edit: As outis points out, this expression will not correctly match a PORTION in which the final character before the closing quote is an escaped backslash. If you don't anticipate backslashes in your text you should be fine though.
Upvotes: 1
Reputation: 342659
if your data is comma delimited and do not have embedded commas, just split on "," and get the appropriate fields
while(<>){
chomp;
@s = split /,/;
if ($s[0] eq ""){
print "$s[1]\n";
}else{
print $s[0]."\n";
}
}
output
$ perl perl.pl file
"\"abc123"
"abc123\" "
"\"abc123\""
"abc\"123\""
"abc123"
Upvotes: 0