Reputation: 45
i have a file contain for example this text:
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
I found a module which calculates cosine similaity, http://search.cpan.org/~wollmers/Bag-Similarity-0.019/lib/Bag/Similarity/Cosine.pm
I did a simple test in the bignning,
my $cosine = Bag::Similarity::Cosine->new;
my $similarity = $cosine->similarity(['perl','java','python','php','scala'],['java','pascal','perl','ruby','ada']);
print $similarity;
The rusult was 0.4;
The problem when i read from the file and calculate the cosine between each line, the results are different, this is the code:
open(F,"/home/ahmed/FILE.txt") or die " Pb pour ouvrir";
my @data; # containt each line of the FILE in each case
while(<F>) {
chomp;
push @data, $_;
}
#print join " ", @data;
my $cosine = Bag::Similarity::Cosine->new;
for my $i ( 0 .. $#data-1 ) {
for my $j ( $i + 1 .. $#data ) {
my $similarity = $cosine->similarity($data[$i],$data[$j]);
print "line $i a une similarite de $similarity avec line $j\n";
$i + 1,
$j + 1;
}
}
the results :
line 0 has a similarity of 0.933424735647156 with line 1
line 0 has a similarity of 0.953945734121021 with line 2
line 0 has a similarity of 0.939759036144578 with line 3
line 1 has a similarity of 0.917585834612093 with line 2
line 1 has a similarity of 0.945092544842746 with line 3
line 2 has a similarity of 0.908826679128811 with line 3
the similarity must be 0.4 between line 1 and 2;
I changed the FILE like this :
['perl','java','python','php','scala']
['java','pascal','perl','ruby','ada']
['ASP','awk','php','java','perl']
['C#','ada','python','java','scala']
but the same result, Thank you.
Upvotes: 0
Views: 442
Reputation: 69314
I know nothing at all about this module. But I can read the documentation.
It looks to me like the module has two methods. similarity()
is used for comparing two strings and from_bags()
is used to compare two references to arrays containing strings. I expect that when you call similarity
passing it two array references, then what gets compared is actually the stringification of the two references.
Try switching to from_bags()
and see if that's any better.
Update: On investigating further, I see that similarity()
will compare any kind of input (strings, array refs or hash refs).
This demonstrates using similarity()
to compare the lines as text and as arrays of words.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Bag::Similarity::Cosine;
chomp(my @data = <DATA>);
my $cos = Bag::Similarity::Cosine->new;
for my $i (0 .. $#data - 1) {
for my $j (1 .. $#data) {
next if $i == $j;
say "$i -> $j: strings ", $cos->similarity($data[$i], $data[$j]);
say "$i -> $j: array refs ", $cos->similarity([split /\s+/, $data[$i]], [split /\s+/, $data[$j]]);
}
}
__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
And it gives this output:
$ perl similar
0 -> 1: strings 0.88602000346543
0 -> 1: array refs 0.4
0 -> 2: strings 0.89566858950296
0 -> 2: array refs 0.6
0 -> 3: strings 0.852802865422442
0 -> 3: array refs 0.6
1 -> 2: strings 0.872356744289958
1 -> 2: array refs 0.4
1 -> 3: strings 0.884721984738799
1 -> 3: array refs 0.4
2 -> 1: strings 0.872356744289958
2 -> 1: array refs 0.4
2 -> 3: strings 0.753778361444409
2 -> 3: array refs 0.2
I don't know which version gives you the information you want. I suspect it might be the array reference version.
Upvotes: 0
Reputation: 21676
There is syntax error in your program. Were you trying to use printf
and used print
mistakenly? Not sure about you but below works fine for me.
#!/usr/bin/perl
use strict;
use warnings;
use Bag::Similarity::Cosine;
my $cosine = Bag::Similarity::Cosine->new;
my @data;
while ( <DATA> ) {
push @data, { map { $_ => 1 } split };
}
for my $i ( 0 .. $#data-1 ) {
for my $j ( $i + 1 .. $#data ) {
my $similarity = $cosine->similarity($data[$i],$data[$j]);
print "line $i has a similarity of $similarity with line $j\n";
}
}
__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
Output:
line 0 has a similarity of 0.4 with line 1
line 0 has a similarity of 0.6 with line 2
line 0 has a similarity of 0.6 with line 3
line 1 has a similarity of 0.4 with line 2
line 1 has a similarity of 0.4 with line 3
line 2 has a similarity of 0.2 with line 3
Upvotes: 1