Reputation: 16659
I have the following list of files:
INV_1400524_20170412_052945.pdf
INV_1400524_20170412_063522.pdf
INV_1400524_20170412_090338.pdf
INV_1400524_20170412_092911.pdf
INV_1400971_20170502_095250.pdf
INV_1401580_20170703_100410.pdf
INV_1401880_20170804_112917.pdf
RIN_1300355_20170503_014347.pdf
RIN_1300552_20170518_111143.pdf
RIN_1300552_20170518_122055.pdf
RIN_1300688_20170627_040340.pdf
RIN_1300834_20170727_113641.pdf
RIN_1300834_20170727_154404.pdf
which have the format:
<Document Type>_<Document Number>_<Date>_<Time>.pdf
As you can see, for some reason the same document number has been output multiple times. I want to ignore the duplicates and filter the list down to unique document numbers and the latest date. These documents also have a modified file timestamp that closely matches the date and time in the filename if that helps.
Using perl (I have been using File::Find::Rule) I want to reduce the list down to:
INV_1400524_20170412_092911.pdf
INV_1400971_20170502_095250.pdf
INV_1401580_20170703_100410.pdf
INV_1401880_20170804_112917.pdf
RIN_1300355_20170503_014347.pdf
RIN_1300552_20170518_122055.pdf
RIN_1300688_20170627_040340.pdf
RIN_1300834_20170727_154404.pdf
I have started with
my @pdf_files = File::Find::Rule->new
->in($root_dir)
->name( '*.pdf' )
->mtime (">$days_ago");
But looking at this answer: How can I find the newest .pl file in a directory and all its subdirectories using Perl?
I think there maybe a way to use:
my $rule = File::Find::Rule->new;
$rule->or( $rule->new->name('INV_*.pdf')->....
$rule->or( $rule->new->name('RIN_*.pdf')->....
my @files = $rule->in($root_dir);
to group and filter them down. Any ideas?
Upvotes: 0
Views: 49
Reputation: 53478
There's a nice idiom using grep
:
my %seen;
my @files = grep { not $seen{$_}++ } @files;
Because you postincrement, the test is true the first time, and false all the others. You can also use regex to substring match on e.g. document ID:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
chomp(
my @files = <DATA>
);
my %seen;
@files = grep { m/(\d+)/ and not $seen{$1}++ } @files;
print Dumper \@files;
__DATA__
INV_1400524_20170412_052945.pdf
INV_1400524_20170412_063522.pdf
INV_1400524_20170412_090338.pdf
INV_1400524_20170412_092911.pdf
INV_1400971_20170502_095250.pdf
INV_1401580_20170703_100410.pdf
INV_1401880_20170804_112917.pdf
RIN_1300355_20170503_014347.pdf
RIN_1300552_20170518_111143.pdf
RIN_1300552_20170518_122055.pdf
RIN_1300688_20170627_040340.pdf
RIN_1300834_20170727_113641.pdf
RIN_1300834_20170727_154404.pdf
This outputs:
$VAR1 = [
'INV_1400524_20170412_052945.pdf',
'INV_1400971_20170502_095250.pdf',
'INV_1401580_20170703_100410.pdf',
'INV_1401880_20170804_112917.pdf',
'RIN_1300355_20170503_014347.pdf',
'RIN_1300552_20170518_111143.pdf',
'RIN_1300688_20170627_040340.pdf',
'RIN_1300834_20170727_113641.pdf'
];
If your criteria is more compliated, then you may need to apply a sort to ensure the 'first' is filtered to the top.
There's two approaches there - either you can sort
on the filename - and because you have an ISO date, it looks like that'll work:
@files = grep { m/(\d+)/ and not $seen{$1}++ } sort @files;
Or you can do some sort based on making a stat
syscall (for this you'll need the full file path though, so be warned!)
@files = grep { m/(\d+)/ and not $seen{$1}++} sort { -M $a <=> -M $b } @files;
-M
is the perl filetest that checks age of file (in days).
You could use stat
instead though.
Upvotes: 1