How to filter down a list of files to remove known duplicates

Question

I have the following list of files: INV_1400524_20170412_052945.pdf INV_1400524_20170412_063522.pdf INV_1400524_20170412_090338.pdf INV_1400524_20170412_092911.pdf INV_1400971_20170502_095250.pdf INV_1401580_20170703_100410.pdf INV_1401880_20170804_112917.pdf RIN_1300355_20170503_014347.pdf RIN_1300552_20170518_111143.pdf RIN_1300552_20170518_122055.pdf RIN_1300688_20170627_040340.pdf RIN_1300834_20170727_113641.pdf RIN_1300834_20170727_154404.pdf

which have the format:

___.pdf

As you can see, for some reason the same document number has been output multiple times. I want to ignore the duplicates and filter the list down to unique document numbers and the latest date. These documents also have a modified file timestamp that closely matches the date and time in the filename if that helps.

Using perl (I have been using File::Find::Rule) I want to reduce the list down to: INV_1400524_20170412_092911.pdf INV_1400971_20170502_095250.pdf INV_1401580_20170703_100410.pdf INV_1401880_20170804_112917.pdf RIN_1300355_20170503_014347.pdf RIN_1300552_20170518_122055.pdf RIN_1300688_20170627_040340.pdf RIN_1300834_20170727_154404.pdf

I have started with

my @pdf_files = File::Find::Rule->new
  ->in($root_dir)
   ->name( '*.pdf' )
   ->mtime (">$days_ago");

But looking at this answer: How can I find the newest .pl file in a directory and all its subdirectories using Perl?

I think there maybe a way to use:

my $rule = File::Find::Rule->new;
$rule->or( $rule->new->name('INV_*.pdf')->....
$rule->or( $rule->new->name('RIN_*.pdf')->....
my @files = $rule->in($root_dir);

to group and filter them down. Any ideas?

Sobrique · Accepted Answer

There's a nice idiom using grep:

my %seen; 
my @files = grep { not $seen{$_}++ } @files;

Because you postincrement, the test is true the first time, and false all the others. You can also use regex to substring match on e.g. document ID:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

chomp(
   my @files = 
);

my %seen;
@files = grep { m/(\d+)/ and not $seen{$1}++ } @files;

print Dumper \@files;

__DATA__
INV_1400524_20170412_052945.pdf
INV_1400524_20170412_063522.pdf
INV_1400524_20170412_090338.pdf
INV_1400524_20170412_092911.pdf
INV_1400971_20170502_095250.pdf
INV_1401580_20170703_100410.pdf
INV_1401880_20170804_112917.pdf
RIN_1300355_20170503_014347.pdf
RIN_1300552_20170518_111143.pdf
RIN_1300552_20170518_122055.pdf
RIN_1300688_20170627_040340.pdf
RIN_1300834_20170727_113641.pdf
RIN_1300834_20170727_154404.pdf

This outputs:

$VAR1 = [
          'INV_1400524_20170412_052945.pdf',
          'INV_1400971_20170502_095250.pdf',
          'INV_1401580_20170703_100410.pdf',
          'INV_1401880_20170804_112917.pdf',
          'RIN_1300355_20170503_014347.pdf',
          'RIN_1300552_20170518_111143.pdf',
          'RIN_1300688_20170627_040340.pdf',
          'RIN_1300834_20170727_113641.pdf'
        ];

If your criteria is more compliated, then you may need to apply a sort to ensure the 'first' is filtered to the top.

There's two approaches there - either you can sort on the filename - and because you have an ISO date, it looks like that'll work:

@files = grep { m/(\d+)/ and not $seen{$1}++ } sort @files;

Or you can do some sort based on making a stat syscall (for this you'll need the full file path though, so be warned!)

@files = grep { m/(\d+)/ and not $seen{$1}++} sort { -M $a <=> -M $b } @files;

-M is the perl filetest that checks age of file (in days).

You could use stat instead though.

How to filter down a list of files to remove known duplicates

Answers (1)

Related Questions