Amol
Amol

Reputation: 143

Devnagari String comparison

I have several hundreds of thousand of devnagari words in text file (one word per line). I have to copy most similar word like "अक्तूबर, अक्‍टूबर" "कौम, क़ौम" in another file for correction. Maximum two place difference is allowed for copy. For that I used "awk" for find the difference of words and copy similar words to another file. But its failed, because this command works with only roman character not with devnagari character.

awk -v string=कौम -v string1=क़ौम '{ for (i=1;i<=length(string);i++) { if (substr(string,i,1) != substr(string1,i,1)) { count++ } }} END { print (count/length(string)*100"% difference") }' <<< ""

66.6667% difference

The above percentages is wrong, because above two words having very rear difference and expected difference should be between 5-10%.

Can you suggest me what to do in this case?

python, perl, shell anything is accepted.

Upvotes: 2

Views: 292

Answers (2)

ikegami
ikegami

Reputation: 385917

You appear to want to compare grapheme clusters.

A grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.

That's just a "fancy" way of that each grapheme cluster is a "visual character".

Let's confirm. The following program allows us to look at your string, divided into grapheme clusters.

use open ':std', ':encoding(UTF-8)';

use charnames qw( :full );

for my $arg_idx (0..$#ARGV) {
   my $arg = $ARGV[$arg_idx];

   utf8::decode($arg);

   for my $grapheme_cluster ($arg =~ /\X/g) {
      printf("%s %v04X\n", $grapheme_cluster, $grapheme_cluster);
      for my $code_point (unpack('W*', $grapheme_cluster)) {
         printf("   %04X %s\n", $code_point, charnames::viacode($code_point));
      }
   }

   print("\n") if $arg_idx != $#ARGV;
}

For the one of your set of strings, we get

$ grapheme_clusters क़ौम              $ grapheme_clusters क़ौम           
कौ 0915.094C                         क़ौ 0915.093C.094C                 
   0915 DEVANAGARI LETTER KA            0915 DEVANAGARI LETTER KA       
                                        093C DEVANAGARI SIGN NUKTA      
   094C DEVANAGARI VOWEL SIGN AU        094C DEVANAGARI VOWEL SIGN AU   
म 092E                               म 092E                             
   092E DEVANAGARI LETTER MA            092E DEVANAGARI LETTER MA       

So far so good; this produces a single difference as expected.

For the other set of strings, we get

$ grapheme_clusters अक्तूबर            $ grapheme_clusters अक्‍टूबर
अ 0905                               अ 0905         
   0905 DEVANAGARI LETTER A             0905 DEVANAGARI LETTER A
क् 0915.094D                          क्‍ 0915.094D.200D
   0915 DEVANAGARI LETTER KA            0915 DEVANAGARI LETTER KA
   094D DEVANAGARI SIGN VIRAMA          094D DEVANAGARI SIGN VIRAMA
                                        200D ZERO WIDTH JOINER
तू 0924.0942                          टू 091F.0942
   0924 DEVANAGARI LETTER TA            091F DEVANAGARI LETTER TTA
   0942 DEVANAGARI VOWEL SIGN UU        0942 DEVANAGARI VOWEL SIGN UU
ब 092C                               ब 092C           
   092C DEVANAGARI LETTER BA            092C DEVANAGARI LETTER BA
र 0930                               र 0930
   0930 DEVANAGARI LETTER RA            0930 DEVANAGARI LETTER RA       

Ah, there's an unexpected ZERO WIDTH JOINER in there. If we were to remove it (e.g. using s/\N{ZERO WIDTH JOINER}//g, or by removing all control character using s/\pC//g), we'd get the expected single difference.


Now that we've established what's needed, we can code a solution.

use List::Util qw( max );

sub count_diffs {
   my ($s1, $s2) = @_;

   s/\N{ZERO WIDTH JOINER}//g for $s1, $s2;

   my @s1 = $s1 =~ /\X/g;
   my @s2 = $s2 =~ /\X/g;

   no warnings qw( uninitialized );
   return 0+grep { $s1[$_] ne $s2[$_] } 0..max(0+@s1, 0+@s2)-1;
}

A major problem with this approach is that it doesn't handle insertions or deletions very well. For example, it considers abcdef and bcdef to have 6 differences. It would be much more effective to calculate the Levenshtein distance of the sequence of clusters instead of comparing by index.

use Algorithm::Diff qw( traverse_balanced );

sub count_diffs {
   my ($s1, $s2) = @_;

   s/\N{ZERO WIDTH JOINER}//g for $s1, $s2;

   my @s1 = $s1 =~ /\X/g;
   my @s2 = $s2 =~ /\X/g;

   my $diffs = 0;
   traverse_balanced(\@s1, \@s2,
      {
         DISCARD_A => sub { ++$diffs; },
         DISCARD_B => sub { ++$diffs; },
         CHANGE    => sub { ++$diffs; },
      },
   );

   return $diffs;
}

Finally, for performance reasons, you don't want to compare only two strings at a time; you want to compare each string against every other string at once. I don't know of a readily usable solution for this.

Upvotes: 3

daxim
daxim

Reputation: 39158

use utf8;
use List::Util qw(sum max);
use List::SomeUtils qw(pairwise);

sub norm { $_[0] =~ s/\pC//gr =~ /\X/g }
for my $pair (
    [qw(अक्तूबर अक्‍टूबर)],
    [qw(कौम क़ौम)],
) {
    my @e0 = norm $pair->[0];
    my @e1 = norm $pair->[1];
    my $equal = sum pairwise { $a eq $b } @e0, @e1;
    my $max = max scalar(@e0), scalar(@e1);
    my $similarity = $equal / $max;
    printf "%.1f%% similarity, %.1f%% difference\n",
        100 * $similarity,
        100 * (1 - $similarity);
}

Upvotes: 1

Related Questions