Get almost Duplicate Books from my Booklist so that the duplicates are adjacent To each other

Question

Previously, I asked the question Search For Three Consecutive Words to get almost similar books from a booklist. The idea was, if two strings have three similar consecutive words then they will be considered almost duplicate.

I got a good solution from there. The solution is given bellow.

I am using the following AWK Script (script.awk).

NR == FNR {
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                w[$(i-2),$(i-1),$i]++

        next
}

{
        orig = $0
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                if (w[$(i-2),$(i-1),$i] > 1) {
                        print orig
                        next
                }
}

The input data (TestData.txt) is -

$ cat TestData.txt 
7L: The Seven Levels of Communication
Numbers Guide: The Essentials of Business Numeracy by Richard Stutely
The MVP Machine: How Baseball's New Nonconformists Are Using Data to Build Better Players
Freakonomics: A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt
Freakonomics: A Rogue Economist
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden
Superfreakonomics by Steven Levitt
Moneyball by Michael Lewis
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values
Impossible to Inevitable by Jason Lemkin
How to Sell Your Way Through Life by Napoleon Hill
Venture Deals by Brad Feld & Jason Mendelson
Envisioning the Survey Interview of the Future
Brave Leadership: Unleash Your Most Confident, Powerful, and Authentic Self to Get the Results You Need
Dealers of Lightning: Xerox PARC and the Dawn of the Computer Age
The Seven Levels of Communication: Go From Relationships to Referrals by Michael J. Maher
How to Be a Power Connector: The 5+50+100 Rule for Turning Your Business Network into Profits by Judy Robinett
Lean Startup by Eric Ries
The E-Myth Revisited
The Power of Broke
The Four Steps to the Epiphany by Steve Blank
The Art of the Start
Growth Juice by John A. Weber
Man's Worldly Goods: The Story of the Wealth of Nations by Leo Huberman
The Wealth of Nations by Adam Smith
A History of Central Banking and the Enslavement of Mankind
A History of Money and Banking in the United States: The Colonial Era to World War II
The History of Banking: The History of Banking and How the World of Finance Became What it is Today
The Federal Reserve: What Everyone Needs to Know
The Federal Reserve and its Founders: Money, Politics, and Power
America's Bank: The Epic Struggle to Create the Federal Reserve
The Power and Independence of the Federal Reserve
America's Money Machine: The Story of the Federal Reserve
Too Big to Fail by Andrew Ross Sorkin
Business - Later - Read Review To Confirm
Blogging for Your Business
Liar’s Poker by Michael Lewis
Sensemaking: The Power of the Humanities in the Age of the Algorithm, by Christian Madsbjerg.
Giftology: The Art and Science of Using Gifts to Cut Through the Noise, Increase Referrals, and Strengthen Retention, by John Ruhlin.
Getting Real by the people at Basecamp
Venture Deals by Brad Feld

To get the duplicate books I am giving the command awk -f script.awk TestData.txt TestData.txt.

The output is -

$ awk -f script.awk TestData.txt TestData.txt 
7L: The Seven Levels of Communication
Freakonomics: A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt
Superfreakonomics by Steven Levitt
Moneyball by Michael Lewis
Venture Deals by Brad Feld & Jason Mendelson
The Seven Levels of Communication: Go From Relationships to Referrals by Michael J. Maher
The Power of Broke
Man's Worldly Goods: The Story of the Wealth of Nations by Leo Huberman
The Wealth of Nations by Adam Smith
A History of Central Banking and the Enslavement of Mankind
A History of Money and Banking in the United States: The Colonial Era to World War II
The History of Banking: The History of Banking and How the World of Finance Became What it is Today
The Federal Reserve: What Everyone Needs to Know
The Federal Reserve and its Founders: Money, Politics, and Power
America's Bank: The Epic Struggle to Create the Federal Reserve
The Power and Independence of the Federal Reserve
America's Money Machine: The Story of the Federal Reserve
Liar’s Poker by Michael Lewis
Sensemaking: The Power of the Humanities in the Age of the Algorithm, by Christian Madsbjerg.
Venture Deals by Brad Feld

However, I have a little problem. The Problem is -

Here,

7L: The Seven Levels of Communication AND The Seven Levels of Communication: Go From Relationships to Referrals by Michael J. Maher

are almost duplicate and should be together.

Again,

Moneyball by Michael Lewis AND Liar’s Poker by Michael Lewis

should be together.

Once More,

Venture Deals by Brad Feld & Jason Mendelson AND Venture Deals by Brad Feld

should be together. But they are not. You get the idea :)

Update:

If you notice I made a little change in the input. Money Master the Game by Tony Robbinson is there three times. Zen and the Art of Motorcycle Maintenance: An Inquiry into Values is there two times. I put Freakonomics with Full Or Partial Duplicates total four times.

Freakonomics: A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt
Freakonomics: A Rogue Economist
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden

Give the new input my Expected Output is something like:

7L: The Seven Levels of Communication
The Seven Levels of Communication: Go From Relationships to Referrals by Michael J. Maher
Venture Deals by Brad Feld & Jason Mendelson
Venture Deals by Brad Feld
Freakonomics: A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt
Freakonomics: A Rogue Economist
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden
Moneyball by Michael Lewis
Liar’s Poker by Michael Lewis
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values

NOTE: I might have missed something as I created the output manually. The output does not need to be the same. It is just to give a rough idea.

Explanation of the Expected Output:

Full Or Partial Duplicate books need to be adjacent to each other. Given, The Output only shows Full Or Partial Duplicates. Full Or Partial Duplicates should be shown every time they occur.

For example, Money Master the Game by Tony Robbinson appears three times so it should be shown three times. Zen and the Art of Motorcycle Maintenance: An Inquiry into Values appears two times so it should be shown two times.

Again, There is both Full Or Partial Duplicates of Freakonomics so all of them should appear.

Freakonomics: A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt
Freakonomics: A Rogue Economist
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores t

Books that do not have any Full Or Partial Duplicates should not appear in the output. Example, Books like Blogging for Your Business, Numbers Guide: The Essentials of Business Numeracy by Richard Stutely etc should not be in the output as they do not have any Full Or Partial Duplicates.

cas · Accepted Answer

IMO, this task is better solved using the intersections of sets of words, rather than looking for 3 consecutive words.

Accordingly, the following perl script does not look for 3 consecutive words. Instead, it first reads in the entire input (from stdin and/or one or more files) and (using the Set::Tiny module) creates a set of words for each input line.

Then it processes the input a second time, and (for each line) it prints out any lines read in the first pass which have exact duplicates or where the intersection of sets has 3 or more elements.

It uses a hash array called %sets to store the word sets for each title, and another hash called %titles to count the number of times it has seen each title - this is used in the output phase to ensure it never prints any title more often than it was seen in the input.

In short, it prints duplicate lines and similar lines (i.e. those which have at least 3 of the same words in them) next to each other - the 3 words do not have to be consecutive.

The script ignores several very common small words when constructing the sets, but this can be disabled by commenting out or deleting the line with the OPTIONAL... comment. Or you can edit the common word list to suit your needs.

One thing worth mentioning is that the small words list in the script includes the word by. You can delete it from the list if you like, but the reason why it's there is to stop the script from matching on by plus any two other words - e.g. Aardvark Taxidermy for Personal Wealth by Peter Smith would match The Wealth of Nations by Adam Smith (matches on by, Wealth, and Smith). The first book is (I hope) entirely non-existent but if it did exist, it would not be at all related to an economics text.

Note: this script stores the entire input, and the associated word sets for each input line, in memory. This is unlikely to be a problem for modern systems with a few GiB of free RAM unless the input is extremely large.

Note2: Set::Tiny is packaged for Debian as libset-tiny-perl. It may be available pre-packaged for other distributions too. Otherwise, you can get it from the CPAN link above.

#!/usr/bin/perl -w

use strict;
use Set::Tiny;

# a partial list of common articles, prepositions and small words joined into
# a regex.
my $sw = join("|", qw(
  a about after against all among an and around as at be before between both
  but by can do down during first for from go have he her him how
  I if in into is it its last like me my new of off old
  on or out over she so such that the their there they this through to
  too under up we what when where with without you your)
);

my %sets=();    # word sets for each title.
my %titles=();  # count of how many times we see the same title.

while(<>) {
  chomp;
  # take a copy of the original input line, so we can use it as
  # a key for the hashes later.
  my $orig = $_;

  # "simplify" the input line
  s/[[:punct:]]//g;  #/ strip punctuation characters
  s/^\s*|\s*$//g;    #/ strip leading and trailing spaces
  $_=lc;             #/ lowercase everything, case is not important.
  s/\b($sw)\b//iog;  #/ optional. strip small words
  next if (/^$/);

  $sets{$orig} = Set::Tiny->new(split);
  $titles{$orig}++;
};

my @keys = (sort keys %sets);

foreach my $title (@keys) {
  next unless ($titles{$title} > 0);

  # if we have any exact dupes, print them. and make sure they won't
  # be printed again.
  if ($titles{$title} > 1) {
    print "$title
" x $titles{$title};
    $titles{$title}  = 0;
  };

  foreach my $key (@keys) {
    next unless ($titles{$key} > 0);
    next if ($key eq $title);

    my $intersect = $sets{$key}->intersection($sets{$title});
    my $k=scalar keys %{ $intersect };

    #print STDERR "====>$k(" . join(",",sort keys %{ $intersect }) . "):$title:$key
" if ($k > 1);

    if ($k >= 3) {
      print "$title
" if ($titles{$title} > 0);
      print "$key
" x $titles{$key};
      $titles{$key}   = 0;
      $titles{$title} = 0;
    };
  };
};

Save it as, e.g. blueray.pl, and make it executable with chmod +x.

Given the new sample input, it produces the following output:

$ ./blueray.pl TestData.txt 
7L: The Seven Levels of Communication
The Seven Levels of Communication: Go From Relationships to Referrals by Michael J. Maher
A History of Money and Banking in the United States: The Colonial Era to World War II
The History of Banking: The History of Banking and How the World of Finance Became What it is Today
America's Bank: The Epic Struggle to Create the Federal Reserve
America's Money Machine: The Story of the Federal Reserve
Freakonomics: A Rogue Economist
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
The Federal Reserve and its Founders: Money, Politics, and Power
The Power and Independence of the Federal Reserve
Venture Deals by Brad Feld
Venture Deals by Brad Feld & Jason Mendelson
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values

This is not exactly the same as your example output. Because it checks for the presence of common words in titles while ignoring their exact order, it is more likely to find false positives and less likely to miss matches that it shouldn't (false negatives).

If you want to experiment with this or just see what words it is matching (or almost matching) on, you can uncomment the #print STDERR line

Get almost Duplicate Books from my Booklist so that the duplicates are adjacent To each other

Explanation of the Expected Output:

Answers (1)

Related Questions