Reputation: 408
Okay... so this may be a stretch. I'd like to feed a List of TextStrings to a function and then it would return the regular expression syntax back to me. I'm working with a pattern of Tags or Tags for Items to be tracked. I'd like to to be able to detect all the possible patterns that exist. I thought a regular expression that could sum it up would be great.
Is this something that has been done before.
I'm working in VB.NET C# suggestions are fine to.
Maybe this is a poor programming design. But mostly wanted to now where to start searching? What would I even lookup under google?
Or can you offer me some direction on how to design a function like this?
Upvotes: 2
Views: 345
Reputation: 3235
My thoughts on the matter:
Problem statement
Given a finite set S of finite strings S1..Sn, find a regular expression that matches these strings in their entirety, and no other strings, avoiding the trivial solution ^S1|S2|S3|..|Sn$
Observations
We cannot have '.' character classes or '+', '*' or '{x,}' quantifiers, since those will allow matching of strings outside the input set.
Algorithm
# strings and collections indexed from 1
regex := "^" + make_regex(S) + "$"
function make_regex(Ss)
# string -> length
Ls := { length Ss[i] | 1 <= i <= n }
# set of starting characters
Cs := unique { Ss[i][1] | i in 1..n L[i] >= 1 }
cl := last in Cs
ret := "("
For c in Cs
# substrings of all non-empty strings that start with c
Scs := { S[i][2..] | i in 1..n, L[i] >= 1, S[i][j] = c }
ret := ret + c + make_regex(Scs)
If c != cl
ret := ret + "|"
End
End
If Ss contains ""
ret := ret + "|()"
End
ret := ret + ")"
return ret
End
Input
abc
abcd
cde
cef
Result
^(a(b(c(d|())))|c(d(e)|e(f)))$
I believe this is O(n*Lmax) as finding the regex for a unique string is linear wrt to its length. However, the more common prefixes in the input set, the closer it will be to O(Lmax) and so will be matching against the result - much faster than ^S1|S2|..|Sn$, which is O(n*Lmax).
Upvotes: 0
Reputation: 16281
While the .*
is a funny comment, the question becomes more interesting if you add the constraint that it be a minimal regex (for some cost metric per operator) that accepts all of the data but no other data.
Which leads to the next problem: your data probably doesn't reflect all of the examples, so what you really are looking for is dividing the data by classes (character, numeric, symbol). Even that is going to be difficult though: what if you really care about character casing in one instance, but not in another. What if some symbols are delimiters and others separators and still others part of the data.
In other words, given enough restrictions you can do a breadth first search through the possible regular expressions (in your limited grammar) and when you find one that works you can stop. If that is realistic in your particular use-case would depend highly on those restrictions... it isn't realistic in the unrestricted case of general regular expressions.
Upvotes: 0
Reputation: 85096
Very interesting question. Not sure if there is a good answer or not but this was the first thing that came to mind:
Loop through each target string
Loop through each character in each target string
Categorize that character as precisely as possible. 7 = \d, f=[a-z] etc
Create a list of the categories for each character in order.
Add that list of categories to a list of lists
End character loop
End target string loop
Attempt to use your List of category lists to determine a regex that will match all target strings. For example if your List of Category Lists looks like this:
\d,\d,\d,[a-z],[a-z]
\d,\d,\d,[a-z],[a-z]
\d,\d,[a-z],[a-z]
You may be able to determine your regex needs match two to three digits followed by two lower case letters. Not a whole lot to go off of, but maybe a place to start? I'd be interested to see if you come up with a working solution...
Upvotes: 1