defoo
defoo

Reputation: 5317

Generate regular expression from diff

I am looking for a way to generate regex from based on the differences between 2 input strings. The regex can then be used to match the 3rd string that is also similar to the 2 input strings.

For example, given 3 strings: f1, f2, and f3

f1:

111xxx222

f2:

111yyy222

f3:

111zzz222

I am wondering if there is a generic way to generate the following regex from f1 and f2 that can be used to extract "zzz" in f3:

111(.*)222

Upvotes: 2

Views: 449

Answers (5)

Steve Bennett
Steve Bennett

Reputation: 126457

I think this is a really interesting problem, particularly if it's generalised to something like "given N strings, find the simplest, most restrictive regex R that matches them". In the test data given, that would probably yield unhelpful answers like "111(xxx|yyy)222", so the question might need to be fine-tuned.

This does sound like the kind of pure compsci question that ought to have had a paper written about it...

Upvotes: 0

PaulMcG
PaulMcG

Reputation: 63749

The Python std lib includes a diff module, difflib. Here is a stab at your problem:

>>> import difflib
>>> f1 = "111xxx222"
>>> f2 = "111yyy222"
>>> sm = difflib.SequenceMatcher(a=f1, b=f2)
>>> diff_re = ''
>>> for a,b,n in sm.get_matching_blocks():
...   if not n: continue
...   if diff_re: diff_re += "(.*)"
...   diff_re += f1[a:a+n]
...
>>> print diff_re
'111(.*)222'

Upvotes: 0

Zac Thompson
Zac Thompson

Reputation: 12685

Now you have two problems.

You need to think hard about what the set of possible strings are, and what is the "right" regular expression for each input pair. I know it seems obvious to you right now but convincing your algorithm to do what you expect may be harder than you think. A little TDD is always a good idea: try to think of some pathological cases.

In your example, why shouldn't it generate something like one of these, instead of your suggestion (see @M42's answer for the first one)?:

111(...)222
111(...*)222
111(...+)222
111(..*)222
111(...?)222
(.*)111(.*)222(.*)
etc.

What would you like to happen if the third string is one of these? Or the second?

1111zzz222
111zzz2222
0111zzz222
111zzz2223
111zzzz222
111zz222
11zzz222
111zzz22
111zzz
zzz222
111xyz222
1z1z1z222
111222
1111222
1112222
etc.

If (as I suspect) you already know that some of these cases are going to be impossible, then you already know more about the possible structure of the strings than you are letting on. And if that's the case, you should probably just parse the strings and have done.

If you really really really want to try this anyway then the first thing you need to figure out is whether you want to match "longest common subsequence" or some variation of "longest common substring".

If you really are running this against arbitrary strings, then once you think you have a good algorithm, make sure to test it against a lot of sample data (which should be pretty easy to generate ...)

Upvotes: 2

Vicky
Vicky

Reputation: 13244

There's probably a much nicer way to do this but, in Python:

import re

f1 = "111xxx222"
f2 = "111yyy222"
f3 = "111zzz222"

pre = []
post = []
found_diff = 0

for i in range(len(f1)):
    if f1[i] == f2[i]:
        if found_diff == 0:
            pre.append(f1[i])
        else:
            post.append(f1[i])
    else:
        found_diff = 1

regex = "".join(pre) + "(.*)" + "".join(post)

re.compile(regex)
print re.match(regex, f3).group(1)

prints zzz.

NB this won't work if the first two strings are not the same length.

Upvotes: 0

Toto
Toto

Reputation: 91488

Assuming the 2 strings are equal length, with Perl you can do:

my @x = split//,'111xxx222';
my @y = split//,'111yyy222';
my $r;
for my $i(0..$#x) {
    $r .= $x[$i] eq $y[$i] ? $x[$i] : '.';
}
$r =~ s/(\.+)/($1)/g;
print $r

output:

111(...)222

Upvotes: 1

Related Questions