Janus Bahs Jacquet
Janus Bahs Jacquet

Reputation: 920

Generate all possible matches for regex pattern in PHP

There are quite a few questions on SO asking about how to parse a regex pattern and output all possible matches to that pattern. For some reason, though, every single one of them I can find (1, 2, 3, 4, 5, 6, 7, probably more) are either for Java or some variety of C (and just one for JavaScript), and I currently need to do this in PHP.

I’ve Googled to my heart’s (dis)content, but whatever I do, pretty much the only thing that Google gives me is links to the docs for preg_match() and pages about how to use regex, which is the opposite of what I want here.

My regex patterns are all very simple and guaranteed to be finite; the only syntax used is:

So an example might be [ct]hun(k|der)(s|ed|ing)? to match all possible forms of the verbs chunk, thunk, chunder and thunder, for a total of sixteen permutations.

Ideally, there’d be a library or tool for PHP which will iterate through (finite) regex patterns and output all possible matches, all ready to go. Does anyone know if such a library/tool already exists?

If not, what is an optimised way to approach making one? This answer for JavaScript is the closest I’ve been able to find to something I should be able to adapt, but unfortunately I just can’t wrap my head around how it actually works, which makes adapting it more tricky. Plus there may well be better ways of doing it in PHP anyway. Some logical pointers as to how the task would best be broken down would be greatly appreciated.

Edit: Since apparently it wasn’t clear how this would look in practice, I am looking for something that will allow this type of input:

$possibleMatches = parseRegexPattern('[ct]hun(k|der)(s|ed|ing)?');

– and printing $possibleMatches should then give something like this (the order of the elements is not important in my case):

Array
(
    [0] => chunk
    [1] => thunk
    [2] => chunks
    [3] => thunks
    [4] => chunked
    [5] => thunked
    [6] => chunking
    [7] => thunking
    [8] => chunder
    [9] => thunder
    [10] => chunders
    [11] => thunders
    [12] => chundered
    [13] => thundered
    [14] => chundering
    [15] => thundering
)

Upvotes: 1

Views: 1023

Answers (1)

Steven
Steven

Reputation: 6148

Method

  1. You need to strip out the variable patterns; you can use preg_match_all to do this

    preg_match_all("/(\[\w+\]|\([\w|]+\))/", '[ct]hun(k|der)(s|ed|ing)?', $matches);
    
    /* Regex:
    
    /(\[\w+\]|\([\w|]+\))/
    /                       : Pattern delimiter
     (                      : Start of capture group
      \[\w+\]               : Character class pattern
             |              : OR operator
              \([\w|]+\)    : Capture group pattern
                        )   : End of capture group
                         /  : Pattern delimiter
    
    */
    
  2. You can then expand the capture groups to letters or words (depending on type)

    $array = str_split($cleanString, 1); // For a character class
    $array = explode("|", $cleanString); // For a capture group
    
  3. Recursively work your way through each $array

Code

function printMatches($pattern, $array, $matchPattern)
{
    $currentArray = array_shift($array);

    foreach ($currentArray as $option) {
        $patternModified = preg_replace($matchPattern, $option, $pattern, 1);
        if (!count($array)) {
            echo $patternModified, PHP_EOL;
        } else {
            printMatches($patternModified, $array, $matchPattern);
        }
    }
}

function prepOptions($matches)
{
    foreach ($matches as $match) {
        $cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
        
        if ($match[0] === "[") {
            $array = str_split($cleanString, 1);
        } elseif ($match[0] === "(") {
            $array = explode("|", $cleanString);
        }
        if ($match[-1] === "?") {
            $array[] = "";
        }
        $possibilites[] = $array;
    }
    return $possibilites;
}

$regex        = '[ct]hun(k|der)(s|ed|ing)?';
$matchPattern = "/(\[\w+\]|\([\w|]+\))\??/";

preg_match_all($matchPattern, $regex, $matches);

printMatches(
    $regex,
    prepOptions($matches[0]),
    $matchPattern
);

Additional functionality

Expanding nested groups

In use you would put this before the "preg_match_all".

$regex        = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';

echo preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
    $output = explode("|", $array[3]);
    if ($array[0][-1] === "?") {
        $output[] = "";
    }
    foreach ($output as &$option) {
        $option = $array[2] . $option;
    }
    return $array[1] . implode("|", $output);
}, $regex), PHP_EOL;

Output:

This happen(s|ed) to (become|be|have|having) test case 1?

Matching single letters

The bones of this would be to update the regex:

$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";

and add an else to the prepOptions function:

} else {
    $array = [$cleanString];
}

Full working example

function printMatches($pattern, $array, $matchPattern)
{
    $currentArray = array_shift($array);

    foreach ($currentArray as $option) {
        $patternModified = preg_replace($matchPattern, $option, $pattern, 1);
        if (!count($array)) {
            echo $patternModified, PHP_EOL;
        } else {
            printMatches($patternModified, $array, $matchPattern);
        }
    }
}

function prepOptions($matches)
{
    foreach ($matches as $match) {
        $cleanString = preg_replace("/[\[\]\(\)\?]/", "", $match);
        
        if ($match[0] === "[") {
            $array = str_split($cleanString, 1);
        } elseif ($match[0] === "(") {
            $array = explode("|", $cleanString);
        } else {
            $array = [$cleanString];
        }
        if ($match[-1] === "?") {
            $array[] = "";
        }
        $possibilites[] = $array;
    }
    return $possibilites;
}

$regex        = 'This happen(s|ed) to (be(come)?|hav(e|ing)) test case 1?';
$matchPattern = "/(?:(\[\w+\]|\([\w|]+\))\??|(\w\?))/";

$regex = preg_replace_callback("/(\(|\|)(\w+)(?:\(([\w\|]+)\)\??)/", function($array){
    $output = explode("|", $array[3]);
    if ($array[0][-1] === "?") {
        $output[] = "";
    }
    foreach ($output as &$option) {
        $option = $array[2] . $option;
    }
    return $array[1] . implode("|", $output);
}, $regex);


preg_match_all($matchPattern, $regex, $matches);

printMatches(
    $regex,
    prepOptions($matches[0]),
    $matchPattern
);

Output:

This happens to become test case 1
This happens to become test case 
This happens to be test case 1
This happens to be test case 
This happens to have test case 1
This happens to have test case 
This happens to having test case 1
This happens to having test case 
This happened to become test case 1
This happened to become test case 
This happened to be test case 1
This happened to be test case 
This happened to have test case 1
This happened to have test case 
This happened to having test case 1
This happened to having test case 

Upvotes: 2

Related Questions