Reputation: 5317
I am looking for a way to generate regex from based on the differences between 2 input strings. The regex can then be used to match the 3rd string that is also similar to the 2 input strings.
For example, given 3 strings: f1, f2, and f3
f1:
111xxx222
f2:
111yyy222
f3:
111zzz222
I am wondering if there is a generic way to generate the following regex from f1 and f2 that can be used to extract "zzz" in f3:
111(.*)222
Upvotes: 2
Views: 449
Reputation: 126457
I think this is a really interesting problem, particularly if it's generalised to something like "given N strings, find the simplest, most restrictive regex R that matches them". In the test data given, that would probably yield unhelpful answers like "111(xxx|yyy)222", so the question might need to be fine-tuned.
This does sound like the kind of pure compsci question that ought to have had a paper written about it...
Upvotes: 0
Reputation: 63749
The Python std lib includes a diff module, difflib. Here is a stab at your problem:
>>> import difflib
>>> f1 = "111xxx222"
>>> f2 = "111yyy222"
>>> sm = difflib.SequenceMatcher(a=f1, b=f2)
>>> diff_re = ''
>>> for a,b,n in sm.get_matching_blocks():
... if not n: continue
... if diff_re: diff_re += "(.*)"
... diff_re += f1[a:a+n]
...
>>> print diff_re
'111(.*)222'
Upvotes: 0
Reputation: 12685
You need to think hard about what the set of possible strings are, and what is the "right" regular expression for each input pair. I know it seems obvious to you right now but convincing your algorithm to do what you expect may be harder than you think. A little TDD is always a good idea: try to think of some pathological cases.
In your example, why shouldn't it generate something like one of these, instead of your suggestion (see @M42's answer for the first one)?:
111(...)222
111(...*)222
111(...+)222
111(..*)222
111(...?)222
(.*)111(.*)222(.*)
etc.
What would you like to happen if the third string is one of these? Or the second?
1111zzz222
111zzz2222
0111zzz222
111zzz2223
111zzzz222
111zz222
11zzz222
111zzz22
111zzz
zzz222
111xyz222
1z1z1z222
111222
1111222
1112222
etc.
If (as I suspect) you already know that some of these cases are going to be impossible, then you already know more about the possible structure of the strings than you are letting on. And if that's the case, you should probably just parse the strings and have done.
If you really really really want to try this anyway then the first thing you need to figure out is whether you want to match "longest common subsequence" or some variation of "longest common substring".
If you really are running this against arbitrary strings, then once you think you have a good algorithm, make sure to test it against a lot of sample data (which should be pretty easy to generate ...)
Upvotes: 2
Reputation: 13244
There's probably a much nicer way to do this but, in Python:
import re
f1 = "111xxx222"
f2 = "111yyy222"
f3 = "111zzz222"
pre = []
post = []
found_diff = 0
for i in range(len(f1)):
if f1[i] == f2[i]:
if found_diff == 0:
pre.append(f1[i])
else:
post.append(f1[i])
else:
found_diff = 1
regex = "".join(pre) + "(.*)" + "".join(post)
re.compile(regex)
print re.match(regex, f3).group(1)
prints zzz.
NB this won't work if the first two strings are not the same length.
Upvotes: 0
Reputation: 91488
Assuming the 2 strings are equal length, with Perl you can do:
my @x = split//,'111xxx222';
my @y = split//,'111yyy222';
my $r;
for my $i(0..$#x) {
$r .= $x[$i] eq $y[$i] ? $x[$i] : '.';
}
$r =~ s/(\.+)/($1)/g;
print $r
output:
111(...)222
Upvotes: 1