nmh
nmh

Reputation: 501

Amend perl script so that words are matched on a word for word basis

I have been using this perl script (thanks to Jeff Schaller) to match 3 or more words in the title fields of two separate csv files. Original question here:

https://unix.stackexchange.com/questions/283942/matching-3-or-more-words-from-fields-in-separate-csv-files?noredirect=1#comment494461_283942

I have also added some exception functionality following advice from meuh:

#!/bin/perl

my @csv2 = ();
open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;

my %csv2hash = ();
for (@csv2) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  $csv2hash{$_} = $title;
}

open CSV1, "<csv1" or die;
while (<CSV1>) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  my @titlewords = split /\s+/, $title;    #/ get words

  my @new;                          #add exception words which shouldn't be matched
  foreach my $t (@titlewords){
  push(@new, $t) if $t !~ /^(and|if|where)$/i;
  }
  @titlewords = @new;
  my $desired = 3;
  my $matched = 0;
  foreach my $csv2 (keys %csv2hash) {
    my $count = 0;
    my $value = $csv2hash{$csv2};
    foreach my $word (@titlewords) {
      ++$count if $value =~ /\b$word\b/i;
      last if $count >= $desired;
    }
    if ($count >= $desired) {
      print "$csv2\n";
      ++$matched;
    }
  }
  print "$_\n" if $matched;
}
close CSV1;

During my testing, one issue I've found that I would like to tweak is that if csv2 contains a single common word such as the, if this is replicated in csv1 three or more times then three positive matches is found. To clarify:

If csv1 contains:

1216454,the important people feel the same way as the others, 15445454, 45445645

^ i.e. there are three insatnces of the in the above line

If csv2 contains:

14564564,the tallest man on earth,546456,47878787

^ i.e. there is one instance of the in this line

Then I would like only one word to be classed as matching, and there be no output (based on my desired number of matching words- 3 ) because there is only one instance of the matching word in one of the files.

However if:

csv1 contained:

1216454,the important people feel the same way as the others,15445454, 45445645

and csv2 contained:

15456456,the only way the man can sing the blues,444545,454545

Then, as there are three matching words in each (i.e. 3 instances of the word the in each title, then I would like this to be classed as a matching title based on my desired number of matching words being 3 or more, thus generating the output:

1216454,the important people feel the same way as the others,15445454, 45445645
15456456,the only way the man can sing the blues,444545,454545

I would like to amend the script so that if there is one instance of a word in a csv, and multiple instances of the same word in the other csv then that is classed as only one match. However, if there were say 3 instance of the word the in both files, then it should still be classed as three matches. Basically I would like matches to be on a word for word basis. Everything about the script other than this is perfect so I would rather not go back to the drawing board completely as I am happy with everything other than this. I hope I've explained it ok, if anyone need any clarification let me know.

Upvotes: 2

Views: 34

Answers (1)

terdon
terdon

Reputation: 3380

If you just wan to count unique matches, you can use a hash instead of a list to collect the words from csv1, just like you do for csv2, and then also count the occurrences of each word separately:

#!/usr/bin/env perl

my @csv2 = ();
open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;

my %csv2hash = ();
for (@csv2) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  $csv2hash{$_} = $title;
}

open CSV1, "<csv1" or die;
while (<CSV1>) {
  chomp;
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
    my %words;
    $words{$_}++ for split /\s+/, $title;    #/ get words
    ## Collect unique words
    my @titlewords = keys(%words);
  my @new;                          #add exception words which shouldn't be matched
  foreach my $t (@titlewords){
        push(@new, $t) if $t !~ /^(and|if|where)$/i;
  }
  @titlewords = @new;
  my $desired = 3;
  my $matched = 0;
  foreach my $csv2 (keys %csv2hash) {
    my $count = 0;
    my $value = $csv2hash{$csv2};
    foreach my $word (@titlewords) {
            my @matches   = ( $value=~/\b$word\b/ig );
            my $numIncsv2 = scalar(@matches);
            @matches      = ( $title=~/\b$word\b/ig );
            my $numIncsv1 = scalar(@matches);
            ++$count if $value =~ /\b$word\b/i;
            if ($count >= $desired || ($numIncsv1 >= $desired && $numIncsv2 >= $desired)) {
                $count = $desired+1;
                last;
            }
    }
    if ($count >= $desired) {
      print "$csv2\n";
      ++$matched;
    }
  }
  print "$_\n" if $matched;
}
close CSV1;

Upvotes: 1

Related Questions