Ben_Coding
Ben_Coding

Reputation: 408

Algorithm to Learn a Regular Expression Pattern

Okay... so this may be a stretch. I'd like to feed a List of TextStrings to a function and then it would return the regular expression syntax back to me. I'm working with a pattern of Tags or Tags for Items to be tracked. I'd like to to be able to detect all the possible patterns that exist. I thought a regular expression that could sum it up would be great.

Is this something that has been done before.

I'm working in VB.NET C# suggestions are fine to.

Maybe this is a poor programming design. But mostly wanted to now where to start searching? What would I even lookup under google?

Or can you offer me some direction on how to design a function like this?

Upvotes: 2

Views: 345

Answers (3)

staafl
staafl

Reputation: 3235

My thoughts on the matter:

Problem statement

Given a finite set S of finite strings S1..Sn, find a regular expression that matches these strings in their entirety, and no other strings, avoiding the trivial solution ^S1|S2|S3|..|Sn$

Observations

We cannot have '.' character classes or '+', '*' or '{x,}' quantifiers, since those will allow matching of strings outside the input set.

Algorithm

# strings and collections indexed from 1

regex := "^" + make_regex(S) + "$"

function make_regex(Ss)

    # string -> length
    Ls := { length Ss[i] | 1 <= i <= n } 

    # set of starting characters
    Cs := unique { Ss[i][1] | i in 1..n L[i] >= 1 } 
    cl := last in Cs

    ret := "("
    For c in Cs

        # substrings of all non-empty strings that start with c
        Scs := { S[i][2..] | i in 1..n, L[i] >= 1, S[i][j] = c }
        ret := ret + c + make_regex(Scs) 

        If c != cl
            ret := ret + "|"
        End

    End

    If Ss contains ""
        ret := ret + "|()"

    End
    ret := ret + ")"
    return ret
End

Input

abc
abcd
cde
cef

Result

^(a(b(c(d|())))|c(d(e)|e(f)))$

I believe this is O(n*Lmax) as finding the regex for a unique string is linear wrt to its length. However, the more common prefixes in the input set, the closer it will be to O(Lmax) and so will be matching against the result - much faster than ^S1|S2|..|Sn$, which is O(n*Lmax).

Upvotes: 0

Godeke
Godeke

Reputation: 16281

While the .* is a funny comment, the question becomes more interesting if you add the constraint that it be a minimal regex (for some cost metric per operator) that accepts all of the data but no other data.

Which leads to the next problem: your data probably doesn't reflect all of the examples, so what you really are looking for is dividing the data by classes (character, numeric, symbol). Even that is going to be difficult though: what if you really care about character casing in one instance, but not in another. What if some symbols are delimiters and others separators and still others part of the data.

In other words, given enough restrictions you can do a breadth first search through the possible regular expressions (in your limited grammar) and when you find one that works you can stop. If that is realistic in your particular use-case would depend highly on those restrictions... it isn't realistic in the unrestricted case of general regular expressions.

Upvotes: 0

Abe Miessler
Abe Miessler

Reputation: 85096

Very interesting question. Not sure if there is a good answer or not but this was the first thing that came to mind:

Loop through each target string
    Loop through each character in each target string
       Categorize that character as precisely as possible.  7 = \d, f=[a-z] etc
       Create a list of the categories for each character in order.
       Add that list of categories to a list of lists
    End character loop
End target string loop

Attempt to use your List of category lists to determine a regex that will match all target strings. For example if your List of Category Lists looks like this:

\d,\d,\d,[a-z],[a-z]
\d,\d,\d,[a-z],[a-z]
\d,\d,[a-z],[a-z]

You may be able to determine your regex needs match two to three digits followed by two lower case letters. Not a whole lot to go off of, but maybe a place to start? I'd be interested to see if you come up with a working solution...

Upvotes: 1

Related Questions