Reputation: 189
How to find if a string is present with one or two mismatch in another string?
my $find = "MATCH";
my $search = "stringisMATTHhere";
# $search has one mismatch: MATTH
# for exact match, this one seems working
if ($search =~ /$find/){
print "String found";
}
else {
print "String not found";
}
How can I solve this issue with one mismatch: MSTCH, AATCH, MACCH, etc. and two mismatches: ATTCH, MGGCH, etc
Upvotes: 0
Views: 1431
Reputation: 385789
So you want to do
/
..TCH | .A.CH | .AT.H | .ATC. |
M..CH | M.T.H | M.TC. |
MA..H | MA.C. |
MAT..
/x
or
/
\w\wTCH | \wA\wCH | \wAT\wH | \wATC\w |
M\w\wCH | M\wT\wH | M\wTC\w |
MA\w\wH | MA\wC\w |
MAT\w\w
/x
Easy enough:
my @subpats;
for my $i (0..length($find)-1) {
for my $j ($i+1..length($find)-1) {
my $subpat = join('',
substr($find, 0, $i),
'.', # or '\\w'
substr($find, $i+1, $j-$i-1),
'.', # or '\\w'
substr($find, $j+1),
);
push @subpats, $subpat;
}
}
my $pat = join('|', @subpats);
$search =~ /$pat/
Perl 5.10+ trie-based alternations should optimize the common leading prefixes into something efficient. Saves us the trouble of generating (?:.…|M…)
.
Upvotes: 2
Reputation:
I got re-interested in this so I thought I'd try something with a little more
variability in a controlled way.
Features:
- Can set a min/max mismatch range for individual find's.
- Can set a flag to exclude/include space 0x20 or less in the mismatch count.
- Automatically escape meta-chars in find's.
That's it.
Good Luck!!
Regex:
(?s)
(?{ $cnt = 0; $lcnt = 0 })
(?:
(?>
(??{ $aryinput[$lcnt++] })
| (?&getexpr)
)
){$len}
(??{ $cnt >= $mincnt && $cnt <= $maxcnt ? '' : '(?!)' })
(?(DEFINE)
(?<getexpr>
(??{ ++$cnt <= $maxcnt ?
( $visible_only ?
( $aryinput[$lcnt-1] le ' ' ?
'(?!)'
: '[^\x{0}-\x{20}]'
)
: '.'
)
: '(?!)'
})
)
)
Perl code:
use strict;
use warnings;
my $target =
"
one mismatch: MSTCH, AATCH, MACCH, etc. and two mismatches: ATTCH, MGGCH,
MA1CH T23S
M.1CH T23S
MAT1 H2T3IS
0M[T2CH THaS
0M[T2CH THaS
MA1CH THIS
MATCH THIS
MATCHT1IS
MA1CH THIS
MAT1H THIb
MATCH THIS
MArCH THIS
AATCH THIS
[()+?.*{}|]
[()X?.*{}|]
[()+?.SS}|]
";
my @aryinput = ();
my ($rx, $find, $visible_only, $len, $cnt, $mincnt, $maxcnt, $lcnt) = ('', '',0,0,0,0,0,0);
my @TestRuns = (
{ find => 'MATCH THIS', visible => 1, min => 0, max => 3 },
{ find => 'MATCH', visible => 1, min => 0, max => 3 },
{ find => 'MATCH THIS', visible => 0, min => 0, max => 3 },
{ find => 'MATCH', visible => 0, min => 2, max => 3 },
{ find => 'MATCH', visible => 0, min => 1, max => 1 },
{ find => '[()+?.*{}|]', visible => 1, min => 1, max => 3 },
);
for ( @TestRuns )
{
GetParms( $_ );
SetFindArray( $find );
print "\nFind($len), ", ($visible_only ? "not counting control char" : "counting any char"), ", minmax($mincnt,$maxcnt):\n'$find'\n";
while( $target =~ /$rx/g )
{
print " cnt($cnt) : '$&'\n";
}
}
# ==================================
# ==================================
sub GetParms
{
my ($href) = @_;
( $find, $visible_only, $mincnt, $maxcnt ) =
( $$href{find}, $$href{visible}, $$href{min}, $$href{max} );
}
sub SetFindArray
{
my ($inp) = @_;
@aryinput = ();
@aryinput = map { s/([\\().?*+{}|\[\]])/\\$1/; $_ } split '', $inp;
$len = @aryinput;
$rx = qr/(?s)(?{ $cnt = 0; $lcnt = 0 })(?s)(?:(?>(??{ $aryinput[$lcnt++] })|(?&getexpr))){$len}(??{ $cnt >= $mincnt && $cnt <= $maxcnt ? '' : '(?!)' })(?(DEFINE)(?<getexpr>(??{ ++$cnt <= $maxcnt ?
( $visible_only ?
( $aryinput[$lcnt-1] le ' ' ?
'(?!)'
: '[^\x{0}-\x{20}]'
)
: '.'
)
: '(?!)'
})))/;
}
Output:
Find(10), not counting control char, minmax(0,3):
'MATCH THIS'
cnt(3) : 'MA1CH T23S'
cnt(1) : 'MA1CH THIS'
cnt(2) : 'MAT1H THIb'
cnt(0) : 'MATCH THIS'
cnt(1) : 'MArCH THIS'
cnt(1) : 'AATCH THIS'
Find(5), not counting control char, minmax(0,3):
'MATCH'
cnt(1) : 'MSTCH'
cnt(1) : 'AATCH'
cnt(1) : 'MACCH'
cnt(2) : 'ATTCH'
cnt(2) : 'MGGCH'
cnt(1) : 'MA1CH'
cnt(2) : 'M.1CH'
cnt(3) : 'M[T2C'
cnt(3) : 'M[T2C'
cnt(1) : 'MA1CH'
cnt(0) : 'MATCH'
cnt(0) : 'MATCH'
cnt(1) : 'MA1CH'
cnt(1) : 'MAT1H'
cnt(0) : 'MATCH'
cnt(1) : 'MArCH'
cnt(1) : 'AATCH'
Find(10), counting any char, minmax(0,3):
'MATCH THIS'
cnt(3) : 'MA1CH T23S'
cnt(2) : 'MA1CH THIS'
cnt(1) : 'MATCH THIS'
cnt(1) : 'MA1CH THIS'
cnt(2) : 'MAT1H THIb'
cnt(0) : 'MATCH THIS'
cnt(1) : 'MArCH THIS'
cnt(1) : 'AATCH THIS'
Find(5), counting any char, minmax(2,3):
'MATCH'
cnt(3) : ' ATTC'
cnt(2) : 'MGGCH'
cnt(2) : 'M.1CH'
cnt(2) : 'MAT1 '
cnt(3) : 'M[T2C'
cnt(3) : 'M[T2C'
Find(5), counting any char, minmax(1,1):
'MATCH'
cnt(1) : 'MSTCH'
cnt(1) : 'AATCH'
cnt(1) : 'MACCH'
cnt(1) : 'MA1CH'
cnt(1) : 'MA1CH'
cnt(1) : 'MA1CH'
cnt(1) : 'MAT1H'
cnt(1) : 'MArCH'
cnt(1) : 'AATCH'
Find(11), not counting control char, minmax(1,3):
'[()+?.*{}|]'
cnt(1) : '[()X?.*{}|]'
cnt(2) : '[()+?.SS}|]'
Upvotes: 0
Reputation: 451
As far as I know there is only one convenient solution using a special REGEX engine: https://metacpan.org/pod/re::engine::TRE.
Here the solution for your example:
#!/usr/bin/perl
use strict;
use warnings;
use re::engine::TRE max_cost => 2;
my $find = "MATCH";
my $search = "stringisMATTHhere";
if ($search =~ /\($find\)/) {
print $1,"\n";
}
This outputs:
$ perl fuzzy_re.pl
MATTH
Upvotes: 2
Reputation: 451
If the searched string should have the same length (i.e. only mismatches allowed) as stated in a later comment, you can use Hamming distance, which is very fast:
#!/usr/bin/perl
use strict;
use warnings;
my $find = "MATCH";
my $search = "stringisMATTHhere";
my $max_distance = 2;
for my $offset (0..length($search)-length($find)) {
my $hd = hd($find,substr($search,$offset,length($find)));
if ($hd <= $max_distance) {
print substr($search,$offset,length($find)),"\n";
}
}
# assumes byte mode
sub hd {
return ($_[0] ^ $_[1]) =~ tr/\001-\255//;
}
Upvotes: 2