Daniel
Daniel

Reputation: 45

cosine similarity between strings perl

i have a file contain for example this text:

 perl java python php scala 
 java pascal perl ruby ada   
 ASP awk php java perl 
 C# ada python java scala

I found a module which calculates cosine similaity, http://search.cpan.org/~wollmers/Bag-Similarity-0.019/lib/Bag/Similarity/Cosine.pm

I did a simple test in the bignning,

my $cosine = Bag::Similarity::Cosine->new;
 my $similarity = $cosine->similarity(['perl','java','python','php','scala'],['java','pascal','perl','ruby','ada']);
print $similarity;

The rusult was 0.4;

The problem when i read from the file and calculate the cosine between each line, the results are different, this is the code:

open(F,"/home/ahmed/FILE.txt") or die " Pb pour ouvrir";
my @data; # containt each line of the FILE in each case

while(<F>) { 
    chomp; 
    push @data, $_;
}
#print join " ", @data;

 my $cosine = Bag::Similarity::Cosine->new;

for my $i ( 0 .. $#data-1 ) {

    for my $j ( $i + 1 .. $#data ) {

my $similarity = $cosine->similarity($data[$i],$data[$j]);

print "line $i a une similarite de  $similarity avec line $j\n";

 $i + 1,

            $j + 1;
}
}

the results :

line 0 has a similarity of 0.933424735647156 with line 1
line 0 has a similarity of 0.953945734121021 with line 2
line 0 has a similarity of 0.939759036144578 with line 3
line 1 has a similarity of  0.917585834612093 with line 2
line 1 has a similarity of  0.945092544842746 with line 3
line 2 has a similarity of  0.908826679128811 with line 3

the similarity must be 0.4 between line 1 and 2;

I changed the FILE like this :

['perl','java','python','php','scala'] 
['java','pascal','perl','ruby','ada']  
['ASP','awk','php','java','perl']
['C#','ada','python','java','scala']

but the same result, Thank you.

Upvotes: 0

Views: 442

Answers (2)

Dave Cross
Dave Cross

Reputation: 69314

I know nothing at all about this module. But I can read the documentation.

It looks to me like the module has two methods. similarity() is used for comparing two strings and from_bags() is used to compare two references to arrays containing strings. I expect that when you call similarity passing it two array references, then what gets compared is actually the stringification of the two references.

Try switching to from_bags() and see if that's any better.

Update: On investigating further, I see that similarity() will compare any kind of input (strings, array refs or hash refs).

This demonstrates using similarity() to compare the lines as text and as arrays of words.

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Bag::Similarity::Cosine;

chomp(my @data = <DATA>);

my $cos = Bag::Similarity::Cosine->new;

for my $i (0 .. $#data - 1) {
  for my $j (1 .. $#data) {
    next if $i == $j;
    say "$i -> $j: strings ", $cos->similarity($data[$i], $data[$j]);
    say "$i -> $j: array refs ", $cos->similarity([split /\s+/, $data[$i]], [split /\s+/, $data[$j]]);
  }
}

__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala

And it gives this output:

$ perl similar
0 -> 1: strings 0.88602000346543
0 -> 1: array refs 0.4
0 -> 2: strings 0.89566858950296
0 -> 2: array refs 0.6
0 -> 3: strings 0.852802865422442
0 -> 3: array refs 0.6
1 -> 2: strings 0.872356744289958
1 -> 2: array refs 0.4
1 -> 3: strings 0.884721984738799
1 -> 3: array refs 0.4
2 -> 1: strings 0.872356744289958
2 -> 1: array refs 0.4
2 -> 3: strings 0.753778361444409
2 -> 3: array refs 0.2

I don't know which version gives you the information you want. I suspect it might be the array reference version.

Upvotes: 0

Chankey Pathak
Chankey Pathak

Reputation: 21676

There is syntax error in your program. Were you trying to use printf and used print mistakenly? Not sure about you but below works fine for me.

#!/usr/bin/perl
use strict;
use warnings;
use Bag::Similarity::Cosine;

my $cosine = Bag::Similarity::Cosine->new;
my @data;

while ( <DATA> ) {
    push @data, { map { $_ => 1 } split };
}

for my $i ( 0 .. $#data-1 ) {
    for my $j ( $i + 1 .. $#data ) {
        my $similarity = $cosine->similarity($data[$i],$data[$j]);
        print "line $i has a similarity of $similarity with line $j\n";
    }
}

__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala

Output:

line 0 has a similarity of 0.4 with line 1
line 0 has a similarity of 0.6 with line 2
line 0 has a similarity of 0.6 with line 3
line 1 has a similarity of 0.4 with line 2
line 1 has a similarity of 0.4 with line 3
line 2 has a similarity of 0.2 with line 3

Upvotes: 1

Related Questions