Reputation: 143
I have several hundreds of thousand of devnagari words in text file (one word per line). I have to copy most similar word like "अक्तूबर, अक्टूबर" "कौम, क़ौम" in another file for correction. Maximum two place difference is allowed for copy. For that I used "awk" for find the difference of words and copy similar words to another file. But its failed, because this command works with only roman character not with devnagari character.
awk -v string=कौम -v string1=क़ौम '{ for (i=1;i<=length(string);i++) { if (substr(string,i,1) != substr(string1,i,1)) { count++ } }} END { print (count/length(string)*100"% difference") }' <<< ""
66.6667% difference
The above percentages is wrong, because above two words having very rear difference and expected difference should be between 5-10%.
Can you suggest me what to do in this case?
python, perl, shell anything is accepted.
Upvotes: 2
Views: 292
Reputation: 385917
You appear to want to compare grapheme clusters.
A grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
That's just a "fancy" way of that each grapheme cluster is a "visual character".
Let's confirm. The following program allows us to look at your string, divided into grapheme clusters.
use open ':std', ':encoding(UTF-8)';
use charnames qw( :full );
for my $arg_idx (0..$#ARGV) {
my $arg = $ARGV[$arg_idx];
utf8::decode($arg);
for my $grapheme_cluster ($arg =~ /\X/g) {
printf("%s %v04X\n", $grapheme_cluster, $grapheme_cluster);
for my $code_point (unpack('W*', $grapheme_cluster)) {
printf(" %04X %s\n", $code_point, charnames::viacode($code_point));
}
}
print("\n") if $arg_idx != $#ARGV;
}
For the one of your set of strings, we get
$ grapheme_clusters क़ौम $ grapheme_clusters क़ौम
कौ 0915.094C क़ौ 0915.093C.094C
0915 DEVANAGARI LETTER KA 0915 DEVANAGARI LETTER KA
093C DEVANAGARI SIGN NUKTA
094C DEVANAGARI VOWEL SIGN AU 094C DEVANAGARI VOWEL SIGN AU
म 092E म 092E
092E DEVANAGARI LETTER MA 092E DEVANAGARI LETTER MA
So far so good; this produces a single difference as expected.
For the other set of strings, we get
$ grapheme_clusters अक्तूबर $ grapheme_clusters अक्टूबर
अ 0905 अ 0905
0905 DEVANAGARI LETTER A 0905 DEVANAGARI LETTER A
क् 0915.094D क् 0915.094D.200D
0915 DEVANAGARI LETTER KA 0915 DEVANAGARI LETTER KA
094D DEVANAGARI SIGN VIRAMA 094D DEVANAGARI SIGN VIRAMA
200D ZERO WIDTH JOINER
तू 0924.0942 टू 091F.0942
0924 DEVANAGARI LETTER TA 091F DEVANAGARI LETTER TTA
0942 DEVANAGARI VOWEL SIGN UU 0942 DEVANAGARI VOWEL SIGN UU
ब 092C ब 092C
092C DEVANAGARI LETTER BA 092C DEVANAGARI LETTER BA
र 0930 र 0930
0930 DEVANAGARI LETTER RA 0930 DEVANAGARI LETTER RA
Ah, there's an unexpected ZERO WIDTH JOINER
in there. If we were to remove it (e.g. using s/\N{ZERO WIDTH JOINER}//g
, or by removing all control character using s/\pC//g
), we'd get the expected single difference.
Now that we've established what's needed, we can code a solution.
use List::Util qw( max );
sub count_diffs {
my ($s1, $s2) = @_;
s/\N{ZERO WIDTH JOINER}//g for $s1, $s2;
my @s1 = $s1 =~ /\X/g;
my @s2 = $s2 =~ /\X/g;
no warnings qw( uninitialized );
return 0+grep { $s1[$_] ne $s2[$_] } 0..max(0+@s1, 0+@s2)-1;
}
A major problem with this approach is that it doesn't handle insertions or deletions very well. For example, it considers abcdef
and bcdef
to have 6 differences. It would be much more effective to calculate the Levenshtein distance of the sequence of clusters instead of comparing by index.
use Algorithm::Diff qw( traverse_balanced );
sub count_diffs {
my ($s1, $s2) = @_;
s/\N{ZERO WIDTH JOINER}//g for $s1, $s2;
my @s1 = $s1 =~ /\X/g;
my @s2 = $s2 =~ /\X/g;
my $diffs = 0;
traverse_balanced(\@s1, \@s2,
{
DISCARD_A => sub { ++$diffs; },
DISCARD_B => sub { ++$diffs; },
CHANGE => sub { ++$diffs; },
},
);
return $diffs;
}
Finally, for performance reasons, you don't want to compare only two strings at a time; you want to compare each string against every other string at once. I don't know of a readily usable solution for this.
Upvotes: 3
Reputation: 39158
use utf8;
use List::Util qw(sum max);
use List::SomeUtils qw(pairwise);
sub norm { $_[0] =~ s/\pC//gr =~ /\X/g }
for my $pair (
[qw(अक्तूबर अक्टूबर)],
[qw(कौम क़ौम)],
) {
my @e0 = norm $pair->[0];
my @e1 = norm $pair->[1];
my $equal = sum pairwise { $a eq $b } @e0, @e1;
my $max = max scalar(@e0), scalar(@e1);
my $similarity = $equal / $max;
printf "%.1f%% similarity, %.1f%% difference\n",
100 * $similarity,
100 * (1 - $similarity);
}
Upvotes: 1