Antonio Lam
Antonio Lam

Reputation: 13

Perl pattern matching with optional tokens

I have a string like this:

$words = "[a] (good|bad) word [for fun]";

where:

  1. Everything inside [] is optional
  2. and Value inside (..|..) is OR mandatory value

So, the possible outcomes from the above string would be like these:

a good word for fun

a bad word for fun

a good word

a bad Word

good word for fun 

bad word for fun

good word 

bad word 

Can someone help me to find a way to extract all the possible outcome (like the example above) and store them in an array?

Thanks!

Upvotes: 0

Views: 209

Answers (3)

perreal
perreal

Reputation: 98048

use warnings;
use strict;
use constant { OPT => 0, OR => 1, FIXED => 2 };

my $words = "[a] (good|bad) word [for fun]";
my @tokens;
# parse input
my @v = grep {$_} split /(\[|\]|\(|\||\))/, $words;
while (my $token = shift @v) {
  if ($token eq '[') {
    push @tokens, [ OPT, shift @v ];
    shift @v; # ]
  } elsif ($token eq '(') {
    my @list;
    do {
      push (@list, [ FIXED, shift @v] );
    } until (shift @v eq ')'); # '|,)'
    push @tokens, [ OR, \@list ];
  }
  else {
    push @tokens, [FIXED, $token];
  }
}
# generate output
my @phrases = ("");
for my $token (@tokens) {
  my @additions;
  if ($token->[0] == OPT) {
    push @additions, $_.$token->[1] for @phrases;
  } elsif ($token->[0] == FIXED) {
    $_ .= $token->[1] for @phrases;
  } elsif ($token->[0] == OR) {
    foreach my $list (@{$token->[1]}) {
      push @additions, $_.$list->[1] for @phrases;
    }   
    @phrases = (); 
  }
  push @phrases, @additions;
}


print "$_\n" for map {s/^\s+//;s/[ ]+/ /g;$_} @phrases;

Upvotes: 2

Sinan Ünür
Sinan Ünür

Reputation: 118148

I saw this as an opportunity to try using Parse::RecDescent. I don't understand these things very well, so there might have been a better way to write the grammar.

The parser allows me to generate a list of sets of phrases to use. Then, I feed that list of sets to Set::CrossProduct to generate the Cartesian product of sets.

#!/usr/bin/env perl

use strict;
use warnings;

use Parse::RecDescent;
use Set::CrossProduct;

our @list;

my $parser = Parse::RecDescent->new(q{
    List: OptionalPhrase |
          AlternatingMandatoryPhrases |
          FixedPhrase

    OptionalPhrase:
        OptionalPhraseStart
        OptionalPhraseContent
        OptionalPhraseEnd

    OptionalPhraseStart: /\\[/

    OptionalPhraseContent: /[^\\]]+/
        {
            push @::list, [ $item[-1], '' ];
        }

    OptionalPhraseEnd: /\\]/

    AlternatingMandatoryPhrases:
        AlternatingMandatoryPhrasesStart
        AlternatingMandatoryPhrasesContent
        AlternatingMandatoryPhraseEnd

    AlternatingMandatoryPhrasesStart: /\\(/

    AlternatingMandatoryPhrasesContent: /[^|)]+(?:[|][^|)]+)*/
        {
            push @::list, [ split /[|]/, $item[-1] ];
        }

    AlternatingMandatoryPhraseEnd: /\\)/

    FixedPhrase: /[^\\[\\]()]+/
        {
            $item[-1] =~ s/\\A\\s+//;
            $item[-1] =~ s/\s+\z//;
            push @::list, [ $item[-1] ];
        }
});

my $words = "[a] (good|bad) word [for fun]";

1 while defined $parser->List(\$words);

my $iterator = Set::CrossProduct->new(\@list);

while (my $next = $iterator->get) {
    print join(' ', grep length, @$next), "\n";
}

Output:

a good word for fun
a good word
a bad word for fun
a bad word
good word for fun
good word
bad word for fun
bad word

Upvotes: 1

C. Ramseyer
C. Ramseyer

Reputation: 2382

With regular expressions, you can determine if "bad word" matches your pattern "[a] (good|bad) word [for fun]" (which, as regex match, would probably be spelled as /(a )?(good|bad) word( for fun)?/). But it sounds like you actually want to do the inverse, ie. generate all possible inputs from your pattern. This is not something regexes can do.

What you should be looking at is called permutations. Your template string has these parts:

  1. "a " or nothing
  2. "good" or "bad"
  3. " word"
  4. " for fun" or nothing

So there are two possibilities for fragments 1 and 2, just one for fragment three, and again two for fragment 4, giving you 2 * 2 * 1 * 2 = 8 possiblities.

Just store all these possiblities in a multi-dimensional array, e.g.

my $sentence = [["a ", ""], ["good", "bad"], ["word"], ["for fun", ""]];

Then look up permutation algorithms or permutation modules on CPAN to find all the combinations.

As an example for a single permuation, "bad word" would be represented as:

 my $badword = 
    $sentence->[0]->[0] 
  . $sentence->[1]->[1] 
  . $sentence->[2]->[0] 
  . $sentence->[3]->[0];

Upvotes: 1

Related Questions