Billy Moon
Billy Moon

Reputation: 58531

Generate all matches from a subset of regex

I need to define a bunch of vector sequences, which are all a series of L,D,R,U for left, down, right, up or x for break. There are optional parts, and either/or parts. I have been using my own invented system for noting it down, but I want to document this for other, potentially non-programmers to read.

I now want to use a subset (I don't plan on using any wildcards, or infinite repetition for example) of regex to define the vector sequence and a script to produce all possible matching strings...

/LDR/ produces ['LDR']
/LDU?R/ produces ['LDR','LDUR']
/R(LD|DR)U/ produces ['RLDU','RDRU']
/DxR[DL]U?RDRU?/ produces ['DxRDRDR','DxRDRDRU','DxRDURDR','DxRDURDRU','DxRLRDR','DxRLRDRU','DxRLURDR','DxRLURDRU']

Is there an existing library I can use to generate all matches?

EDIT

I realised I will only be needing or statements, as optional things can be specified by thing or nothing maybe a, or b, both optional could be (a|b|). Is there another language I could use to define what I am trying to do?

Upvotes: 1

Views: 589

Answers (3)

גלעד ברקן
גלעד ברקן

Reputation: 23955

Here is a JavaScript example that addresses parsing the (a|b) and (a|b|) possibilities, creates an array of possible substrings, and composes the matches based on this answer.

var regex = /\([RLUD]*\|[RLUD]*\|?\)/, 
    str = "R(LD|DR)U(R|L|)",
    substrings = [], matches = [], str_tmp = str, find

while (find = regex.exec(str_tmp)){
  var index = find.index

  finds = find[0].split(/\|/)
  substrings.push(str_tmp.substr(0, index))

  if (find[0].match(/\|/g).length == 1) 
    substrings.push([finds[0].substr(1), finds[1].replace(/.$/, '')])
  else if (find[0].match(/\|/g).length == 2){
    substrings.push([finds[0].substr(1), ""])
    substrings.push([finds[1], ""])
  }

  str_tmp = str_tmp.substr(index + find[0].length)
}
if (str_tmp) substrings.push([str_tmp])
console.log(substrings) //>>["R", ["LD", "DR"], "U", ["R", ""], ["L", ""]]

//compose matches
function printBin(tree, soFar, iterations) {
  if (iterations == tree.length) matches.push(soFar)
  else if (tree[iterations].length == 2){
      printBin(tree, soFar + tree[iterations][0], iterations + 1)
      printBin(tree, soFar + tree[iterations][1], iterations + 1)
  }
  else printBin(tree, soFar + tree[iterations], iterations + 1)
}
printBin(substrings, "", 0)
console.log(matches) //>>["RLDURL", "RLDUR", "RLDUL", "RLDU", "RDRURL", "RDRUR", "RDRUL", "RDRU"]

Upvotes: 1

Bernhard Barker
Bernhard Barker

Reputation: 55609

I'd imagine what you're trying is quite easy with a tree (as long as it's only or-statements).

Parse a(b|c)d (or any or-statement) into a tree as follows: a has children b and c, b and c have a mutual child d. b and c can both consist of 0 or more nodes (as in c could be g(e|f)h in which case (part of) the tree would be a -> g -> e/f (2 nodes) -> h -> d or c could be empty, in which case (part of) the tree would be a -> d, but an actual physical empty node may simplify things, which you should see when trying to write the code).

Generation of the tree shouldn't be too difficult with either recursion or a stack.

Once you have a tree, it's trivial to recursively iterate through the whole thing and generate all strings.

Also, here is a link to a similar question, providing a library or two.

EDIT:

"shouldn't be too difficult" - okay, maybe not

Here is a somewhat complicated example (Java) that may require some advanced knowledge about stacks.

Here is a slightly simpler version (Java) thanks to inserting a special character between each ((, )), |(, etc.

Note that neither of these are particularly efficient, the point is just to get the idea across.

Upvotes: 1

Billy Moon
Billy Moon

Reputation: 58531

By translating the java code form the link provided by @Dukeling into javascript, I think I have solved my problem...

var Node = function(str){
    this.bracket = false;
    this.children = [];
    this.s = str;
    this.next = null;
    this.addChild = function(child){
        this.children.push(child);
    }
}

var printTree = function(root,prefix){
  prefix = prefix.replace(/\./g, "");
  for(i in root.children){
    var child = root.children[i]
    printTree(child, prefix + root.s);
  }
  if(root.children.length < 1){
    console.log(prefix + root.s);
  }
}

var Stack = function(){
    this.arr = []
    this.push = function(item){
        this.arr.push(item)
    }
    this.pop = function(){
        return this.arr.pop()
    }
    this.peek = function(){
        return this.arr[this.arr.length-1]
    }
}

var createTree = function(s){

    // this line was causing errors for `a(((b|c)d)e)f` because the `(((` was only
    // replacing the forst two brackets.
    // var s = s.replace(/(\(|\||\))(\(|\||\))/g, "$1.$2");
    // this line fixes it
    var s = s.replace(/[(|)]+/g, function(x){ return x.split('').join('.') });

    var str = s.split('');
    var stack = new Stack();
    var root = new Node("");
    stack.push(root); // start node
    var justFinishedBrackets = false;
    for(i in str){
        var c = str[i]
        if(c == '('){
            stack.peek().next = new Node("Y"); // node after brackets
            stack.peek().bracket = true; // node before brackets
        } else if (c == '|' || c == ')'){
            var last = stack.peek(); // for (ab|cd)e, remember b / d so we can add child e to it
            while (!stack.peek().bracket){ // while not node before brackets
                stack.pop();
            }
            last.addChild(stack.peek().next); // for (b|c)d, add d as child to b / c
        } else {
            if (justFinishedBrackets){
                var next = stack.pop().next;
                next.s = "" + c;
                stack.push(next);
            } else {
                var n = new Node(""+c);
                stack.peek().addChild(n);
                stack.push(n);
            }
        }
        justFinishedBrackets = (c == ')');
    }
    return root;
}

// Test it out
var str = "a(c|mo(r|l))e";
var root = createTree(str);
printTree(root, "");
// Prints: ace / amore / amole

I only changed one line, to allow more than two consecutive brackets to be handled, and left the original translation in the comments

I also added a function to return an array of results, instead of printing them...

var getTree = function(root,prefix){
  this.out = this.out || []
  prefix = prefix.replace(/\./g, "");
  for(i in root.children){
    var child = root.children[i]
    getTree(child, prefix + root.s, out);
  }
  if(root.children.length < 1){
    this.out.push(prefix + root.s);
  }
  if(!prefix && !root.s){
    var out = this.out;
    this.out = null
    return out;
  }
}

// Test it
var str = "a(b|c)d";
var root = createTree(str);
console.log(getTree(root, ""));
// logs ["abd","acd"]

The last part, to allow for empty strings too, so... (ab|c|) means ab or c or nothing, and a convenience shortcut so that ab?c is translated into a(b|)c.

var getMatches = function(str){
    str = str.replace(/(.)\?/g,"($1|)")
    // replace all instances of `(???|)` with `(???|µ)`
    // the µ will be stripped out later
    str = str.replace(/\|\)/g,"|µ)")
    // fix issues where last character is `)` by inserting token `µ`
    // which will be stripped out later
    str = str+"µ"
    var root = createTree(str);
    var res = getTree(root, "");
    // strip out token µ
    for(i in res){
        res[i] = res[i].replace(/µ/g,"")
    }
    // return the array of results
    return res
}

getMatches("a(bc|de?)?f");
// Returns: ["abcf","adef","adf","af"]

The last part is a little hack-ish as it relies on µ not being in the string (not an issue for me) and solves one bug, where a ) at the end on the input string was causing incorrect output, by inserting a µ at the end of each string, and then stripping it from the results. I would be happy for someone to suggest a better way to handle these issues, so it can work as a more general solution.

This code as it stands does everything I need. Thanks for all your help!

Upvotes: 2

Related Questions