Reputation: 2604
My search text is as follows.
...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...
It contains many lines(actually a javascript file) but need to parse the values in variable strings , ie aaa , bbb, ccc , ddd , eee
Following is the Perl code, or use PHP at bottom
my $str = <<STR;
...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...
STR
my @matches = $str =~ /(?:\"(.+?)\",?)/g;
print "@matches";
I know the above script will match all instants, but it will parse strings ("xyz") in the other lines also. So I need to check the string var strings =
/var strings = \[(?:\"(.+?)\",?)/g
Using above regex it will parse aaa.
/var strings = \[(?:\"(.+?)\",?)(?:\"(.+?)\",?)/g
Using above, will get aaa , and bbb. So to avoid the regex repeating I used '+' quantifier as below.
/var strings = \[(?:\"(.+?)\",?)+/g
But I got only eee, So my question is why I got eee ONLY when I used '+' quantifier?
Update 1: Using PHP preg_match_all (doing it to get more attention :-) )
$str = <<<STR
...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...
STR;
preg_match_all("/var strings = \[(?:\"(.+?)\",?)+/",$str,$matches);
print_r($matches);
Update 2: Why it matched eee ? Because of the greediness of (?:\"(.+?)\",?)+
. By removing greediness /var strings = \[(?:\"(.+?)\",?)+?/
aaa will be matched. But why only one result? Is there any way it can be achieved by using single regex?
Upvotes: 2
Views: 422
Reputation: 126722
You may prefer this solution which first looks for the string var strings = [
using the /g
modifier. This sets \G
to match immediately after the [
for the next regex, which looks for all immediately following occurrences of double-quoted strings, possibly preceded by commas or whitespace.
my @matches;
if ($str =~ /var \s+ strings \s* = \s* \[ /gx) {
@matches = $str =~ /\G [,\s]* "([^"]+)" /gx;
}
Despite using the /g
modifier your regex /var strings = \[(?:\"(.+?)\",?)+/g
matches only once because there is no second occurrence of var strings = [
. Each match returns a list of the values of the capture variables $1
, $2
, $3
etc. when the match completed, and /(?:"(.+?)",?)+/
(there is no need to escape the double-quotes) captures multiple values into $1
leaving only the final value there. You need to write something like the above , which captures only a single value into $1
for each match.
Upvotes: 2
Reputation: 75222
Here's a single-regex solution:
/(?:\bvar\s+strings\s*=\s*\[|\G,)\s*"([^"]*)"/g
\G
is a zero-width assertion that matches the position where the previous match ended (or the beginning of the string if it's the first match attempt). So this acts like:
var\s+strings\s*=\s*[\s*"([^"]*)"
...on the first attempt, then:
,\s*"([^"]*)"
...after that, but each match has to start exactly where the last one left off.
Here's a demo in PHP, but it will work in Perl, too.
Upvotes: 2
Reputation: 54323
Because the +
tells it to repeat the exact stuff inside brackets (?:"(.+?)",?)
one or more times. So it will match the "eee"
string, end then look for repetitions of that "eee"
string, which it does not find.
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/var strings = \[(?:"(.+?)",?)+/)->explain();
The regular expression:
(?-imsx:var strings = \[(?:"(.+?)",?)+)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
var strings = 'var strings = '
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.+? any character except \n (1 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
,? ',' (optional (matching the most amount
possible))
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
A simpler example would be:
my @m = ('abcd' =~ m/(\w)+/g);
print "@m";
Prints only d
. This is due to:
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/(\w)+/)->explain();
The regular expression:
(?-imsx:(\w)+)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1 (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
)+ end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
If you use the quantifier on the capture group, only the last instance will be used.
Here's a way that works:
my $str = <<STR;
...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...
STR
my @matches;
$str =~ m/var strings = \[(.+?)\]/; # get the array first
my $jsarray = $1;
@matches = $array =~ m/"(.+?)"/g; # and get the strings from that
print "@matches";
Update: A single-line solution (though not a single regex) would be:
@matches = ($str =~ m/var strings = \[(.+?)\]/)[0] =~ m/"(.+?)"/g;
But this is highly unreadable imho.
Upvotes: 1