Jithin
Jithin

Reputation: 2604

Match all occurrences of a string

My search text is as follows.

...
...
var strings = ["aaa","bbb","ccc","ddd","eee"];
...
...

It contains many lines(actually a javascript file) but need to parse the values in variable strings , ie aaa , bbb, ccc , ddd , eee

Following is the Perl code, or use PHP at bottom

my $str = <<STR;
    ...
    ...
    var strings = ["aaa","bbb","ccc","ddd","eee"];
    ...
    ...
STR

my @matches = $str =~ /(?:\"(.+?)\",?)/g;
print "@matches";

I know the above script will match all instants, but it will parse strings ("xyz") in the other lines also. So I need to check the string var strings =

/var strings = \[(?:\"(.+?)\",?)/g

Using above regex it will parse aaa.

/var strings = \[(?:\"(.+?)\",?)(?:\"(.+?)\",?)/g

Using above, will get aaa , and bbb. So to avoid the regex repeating I used '+' quantifier as below.

/var strings = \[(?:\"(.+?)\",?)+/g

But I got only eee, So my question is why I got eee ONLY when I used '+' quantifier?

Update 1: Using PHP preg_match_all (doing it to get more attention :-) )

$str = <<<STR
    ...
    ...
    var strings = ["aaa","bbb","ccc","ddd","eee"];
    ...
    ...
STR;

preg_match_all("/var strings = \[(?:\"(.+?)\",?)+/",$str,$matches);
print_r($matches);

Update 2: Why it matched eee ? Because of the greediness of (?:\"(.+?)\",?)+ . By removing greediness /var strings = \[(?:\"(.+?)\",?)+?/ aaa will be matched. But why only one result? Is there any way it can be achieved by using single regex?

Upvotes: 2

Views: 422

Answers (3)

Borodin
Borodin

Reputation: 126722

You may prefer this solution which first looks for the string var strings = [ using the /g modifier. This sets \G to match immediately after the [ for the next regex, which looks for all immediately following occurrences of double-quoted strings, possibly preceded by commas or whitespace.

my @matches;

if ($str =~ /var \s+ strings \s* = \s* \[ /gx) {
  @matches = $str =~ /\G [,\s]* "([^"]+)" /gx;
}

Despite using the /g modifier your regex /var strings = \[(?:\"(.+?)\",?)+/g matches only once because there is no second occurrence of var strings = [. Each match returns a list of the values of the capture variables $1, $2, $3 etc. when the match completed, and /(?:"(.+?)",?)+/ (there is no need to escape the double-quotes) captures multiple values into $1 leaving only the final value there. You need to write something like the above , which captures only a single value into $1 for each match.

Upvotes: 2

Alan Moore
Alan Moore

Reputation: 75222

Here's a single-regex solution:

/(?:\bvar\s+strings\s*=\s*\[|\G,)\s*"([^"]*)"/g

\G is a zero-width assertion that matches the position where the previous match ended (or the beginning of the string if it's the first match attempt). So this acts like:

var\s+strings\s*=\s*[\s*"([^"]*)"

...on the first attempt, then:

,\s*"([^"]*)"

...after that, but each match has to start exactly where the last one left off.

Here's a demo in PHP, but it will work in Perl, too.

Upvotes: 2

simbabque
simbabque

Reputation: 54323

Because the + tells it to repeat the exact stuff inside brackets (?:"(.+?)",?) one or more times. So it will match the "eee" string, end then look for repetitions of that "eee" string, which it does not find.

use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/var strings = \[(?:"(.+?)",?)+/)->explain();

The regular expression:

(?-imsx:var strings = \[(?:"(.+?)",?)+)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  var strings =            'var strings = '
----------------------------------------------------------------------
  \[                       '['
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      .+?                      any character except \n (1 or more
                               times (matching the least amount
                               possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    ,?                       ',' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

A simpler example would be:

my @m = ('abcd' =~ m/(\w)+/g);
print "@m";

Prints only d. This is due to:

use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/(\w)+/)->explain();

The regular expression:

(?-imsx:(\w)+)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1 (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    \w                       word characters (a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
  )+                       end of \1 (NOTE: because you are using a
                           quantifier on this capture, only the LAST
                           repetition of the captured pattern will be
                           stored in \1)
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

If you use the quantifier on the capture group, only the last instance will be used.


Here's a way that works:

my $str = <<STR;
    ...
    ...
    var strings = ["aaa","bbb","ccc","ddd","eee"];
    ...
    ...
STR

my @matches;
$str =~ m/var strings = \[(.+?)\]/; # get the array first
my $jsarray = $1;
@matches = $array =~ m/"(.+?)"/g; # and get the strings from that

print "@matches";

Update: A single-line solution (though not a single regex) would be:

@matches = ($str =~ m/var strings = \[(.+?)\]/)[0] =~ m/"(.+?)"/g;

But this is highly unreadable imho.

Upvotes: 1

Related Questions