Tristan Forward
Tristan Forward

Reputation: 3514

Find duplicate filenames, and only keep newest file using python

I have +20 000 files, that look like this below, all in the same directory:

8003825.pdf
8003825.tif
8006826.tif

How does one find all duplicate filenames, while ignoring the file extension.

Clarification: I refer to a duplicate being a file with the same filename while ignoring the file extension. I do not care if the file is not 100% the same (ex. hashsize or anything like that)

For example:

"8003825" appears twice

Then look at the metadata of each duplicate file and only keep the newest one.

Similar to this post:

Keep latest file and delete all other

I think I have to create a list of all files, check if file already exists. If so then use os.stat to determine the modification date?

I'm a little concerned about loading all those filename's into memory. And wondering if there is a more pythonic way of doing things...

Python 2.6 Windows 7

Upvotes: 3

Views: 7034

Answers (3)

ovgolovin
ovgolovin

Reputation: 13410

You can do it with O(n) complexity. The solutions with sort have O(n*log(n)) complexity.

import os
from collections import namedtuple

directory = #file directory
os.chdir(directory)

newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file = newest_files.get(name)
    this_file_date = os.path.getmtime(file_name)
    if cashed_file is None:
        newest_files[name] = Entry(this_file_date,file_name)
    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)

newest_files is a dictonary having file names without extensions as keys with values of named tuples which hold file full file name and modification date. If the new file that is encountered is inside the dictionary, its date is compared to the stored in the dictionary one and it is replaced if necessary.

In the end you have a dictionary with the most recent files.

Then you may use this list to perform the second pass. Note, that lookup complexity in the dictionary is O(1). So the overall complexity of looking all n files in the dictionary is O(n).

For example, if you want to leave only the newest files with the same name and delete the other, this can be achieved in the following way:

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file_name = newest_files.get(name).file_name
    if file_name != cashed_file_name: #it's not the newest with this name
        os.remove(file_name)

As suggested by Blckknght in the comments, you can even avoid the second pass and delete the older file as soon as you encounter the newer one, just by adding one line of the code:

    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)
            os.remove(cashed_file.file_name) #this line added

Upvotes: 8

Andrew Clark
Andrew Clark

Reputation: 208455

First, get a list of file names and sort them. This will put any duplicates next to each other.

Then, strip off the file extension and compare to neighbors, os.path.splitext() and itertools.groupby() may be useful here.

Once you have grouped the duplicates, pick the one you want to keep using os.stat().

In the end your code might looks something like this:

import os, itertools

files = os.listdir(base_directory)
files.sort()
for k, g in itertools.groupby(files, lambda f: os.path.splitext(f)[0]):
     dups = list(g)
     if len(dups) > 1:
         # figure out which file(s) to remove

You shouldn't have to worry about memory here, you're looking at something on the order of a couple of megabytes.

Upvotes: 2

Nathan Villaescusa
Nathan Villaescusa

Reputation: 17629

For the filename counter you could use a defaultdict that stores how many times each file appears:

import os
from collections import defaultdict

counter = defaultdict(int)
for file_name in file_names:
   file_name = os.path.splitext(os.path.basename(file_name))[0]
   counter[file_name] += 1

Upvotes: 0

Related Questions