yvf5rcuya4
yvf5rcuya4

Reputation: 59

Creating a list of duplicate filenames with Perl

I've been trying to write a script to pre-process some long lists of files, but I am not confident (nor competent) with Perl yet and am not getting the results I want.

The script below is very much work in progress but I'm stuck on the check for duplicates and would be grateful if anyone could let me know where I am going wrong. The block dealing with duplicates seems to be of the same form as examples I have found but it doesn't seem to work.

#!/usr/bin/perl
use strict;
use warnings;

open my $fh, '<', $ARGV[0] or die "can't open: $!";

foreach my $line (<$fh>) {

#   Trim list to remove directories which do not need to be checked
    next if $line =~ m/Inventory/;
#   MORE TO DO 
    next if $line =~ m/Scanned photos/;

    $line =~ s/\n//; # just for a tidy list when testing
    my @split = split(/\/([^\/]+)$/, $line); # separate filename from rest of path

    foreach (@split) {
        push (my @filenames, "$_");
#       print "@filenames\n"; # check content of array

        my %dupes;

        foreach my $item (@filenames) {
            next unless $dupes{$item}++;
            print "$item\n";
        }
    } 
}

I am struggling to understand what is wrong with my check for duplicates. I know the array contains duplicates (uncommenting the first print function gives me a list with lots of duplicates). The code as it stands generates nothing.

Not the main purpose of my post but my final aim is to remove unique filenames from the list and keep filenames which are in duplicated in other directories. I know that none of these files are identical but many are different versions of the same file which is why I'm focussing on filename.

Eg I would want an input of:

~/Pictures/2010/12345678.jpg ~/Pictures/2010/12341234.jpg ~/Desktop/temp/12345678.jpg

to give an output of:

~/Pictures/2010/12345678.jpg ~/Desktop/temp/12345678.jpg

So I suppose ideally it would be good to check for uniqueness of a match based on the regex without splitting if that is possible.

Upvotes: 3

Views: 316

Answers (2)

user1919238
user1919238

Reputation:

TLP's answer gives lots of good advice. In addition:

Why use both an array and a hash to store the filenames? Simply use the hash as your one storage solution, and you will automatically remove duplicates. i.e:

my %filenames; #outside of the loops

...

foreach (@split) {
    $filenames{$_}++;
}

Now when you want to get the list of unique filenames, just use keys %filenames or, if you want them in alphabetical order, sort keys %filenames. And the value for each hash key is a count of occurrences, so you can find out which ones were duplicated if you care.

Upvotes: 3

TLP
TLP

Reputation: 67900

This below loop does nothing, because the hash and the array only contain one value for each loop iteration:

foreach (@split) {
    push (my @filenames, "$_");        # add one element to lexical array
    my %dupes;
    foreach my $item (@filenames) {    # loop one time
        next unless $dupes{$item}++;   # add one key to lexical hash
        print "$item\n";
    }
}                                      # @filenames and %dupes goes out of scope

A lexical variable (declared with my) has a scope that extends to the surrounding block { ... }, in this case your foreach loop. When they go out of scope, they are reset and all the data is lost.

I don't know why you copy the file names from @split to @filenames, it seems very redundant. The way to dedupe this would be:

my %seen;
my @uniq;

@uniq = grep !$seen{$_}++, @split;

Additional information:

You might also be interested in using File::Basename to get the file name:

use File::Basename;
my $fullpath = "~/Pictures/2010/12345678.jpg";
my $name = basename($fullpath);                  # 12345678.jpg

Your substitution

$line =~ s/\n//;

Should probably be

chomp($line);

When you read from a file handle, using for (foreach) means you read all the lines and store them in memory. It is preferable most times to instead use while, like this:

while (my $line = <$fh>)

Upvotes: 5

Related Questions