Rick
Rick

Reputation: 107

Variable used to measure CosineSimilarity in Text::Document is uninitialized

I am using Text::Document to compute the cosine similarity between two documents. When I try to print the variable holding the resulting cosine similarity score ($sim), I get an error message saying: "Use of unitialized value $sim in concatenation (.) or string...". As far as I can tell, I initialize this variable immediately above the print command. Admittedly, this is my first foray into the Text::Document module and my object construction here is probably faulty/ugly/potentially problematic. Any ideas what's wrong with the variable initialization?

use strict ;
use warnings ;
use autodie ;
use Text::Document ;

### BEGIN BY READING IN EACH FILE ONE BY ONE. ###
################## LOOP BEGIN ##################
# Process every file with a `txt` file type

my $parent = "D:/Cleaned 10Ks" ;
my ($par_dir, $sub_dir);
opendir($par_dir, $parent);

while (my $sub_folders = readdir($par_dir)) {
next if ($sub_folders =~ /^..?$/);  # skip . and ..
my $path = $parent . '/' . $sub_folders;
next unless (-d $path);   # skip anything that isn't a directory
chdir($path) or die "Cant chdir to $path $!";

for my $filename ( grep -f, glob('*') ) {

open my ($fh), '<', $filename;
my $data1 = do {local $/; <$fh> } ;
my $data2 = Text::Document->new(file=>'$data1') ;
my $data3 = $data2->WriteToString() ;
my $data4 = Text::Document::NewFromString($data3) ;

my ($comp_id, $year, $rest) = split '-', $filename, 3;
my $prev_year = ($year ne '00') ? $year - 1 : 99;
my $prev_year_base = join '-', $comp_id, $year ;
my ($prev_year_file) = glob "$prev_year_base*" ;

open my ($fh_prior), '<', $prev_year_file ;
my $data1_prior = do {local $/; <$fh_prior> } ;
my $data2_prior = Text::Document->new(file=>'$data1_prior') ;
my $data3_prior = $data2->WriteToString() ;
my $data4_prior = Text::Document::NewFromString($data3_prior) ;
my $sim = $data4->CosineSimilarity( $data4_prior ) ;

print "The cosine similarity score is $sim\n" ;
}
}

Upvotes: 2

Views: 67

Answers (2)

foundry
foundry

Reputation: 31745

You have a couple of issues..

my $data2 = Text::Document->new(file=>'$data1') ;

Here you seem to imagine that $data2 will be initialised with the content of $data1.

In fact the file keyword does nothing here, and the line is equivalent to

my $data2 = Text::Document->new() ;

You have successfully initialised a Text::Document object but it has no data.

You do the same for the prior object, so you end up comparing two objects with no comparision terms. $sim is empty.

The fix is to add some content to your new objects:

my $data2 = Text::Document->new() ;
$data2->AddContent($data1);

...and the same for the prior object.

In addition you can remove these lines:

my $data3 = $data2->WriteToString() ;
my $data4 = Text::Document::NewFromString($data3) ;

They are redundant. You are just recreating the same (empty) objects.

Upvotes: 3

Tanktalus
Tanktalus

Reputation: 22294

Taking a peek at the source, I see that CosineSimilarity has this nugget:

    if( ($nD==0) || ($nE==0) ){
            return undef;
    } else {
            return $dotProduct / $nD / $nE;
    }

It will return undef rather than blowing up with a divide-by-zero error. (While it's nice to handle your errors, sometimes the error handling makes it less obvious that an error has occurred. I think yours is one such case - once you know about it, checking for undef is more obvious, but if you had got a divide by zero exception, you likely would have looked at things differently.)

Anyway, $nD and $nE are both determined through the EuclideanNorm method called on both $d ($self) and $e. You should probably try printing those out as a next debugging step, I'm going to guess that your $data4_prior will come out with a 0, but it really could be either. Without your actual data, I can't try to find out, so hopefully this gives you a good starting point for further debugging.

Upvotes: 2

Related Questions