Reputation: 3514
I have +20 000 files, that look like this below, all in the same directory:
8003825.pdf
8003825.tif
8006826.tif
How does one find all duplicate filenames, while ignoring the file extension.
Clarification: I refer to a duplicate being a file with the same filename while ignoring the file extension. I do not care if the file is not 100% the same (ex. hashsize or anything like that)
For example:
"8003825" appears twice
Then look at the metadata of each duplicate file and only keep the newest one.
Similar to this post:
Keep latest file and delete all other
I think I have to create a list of all files, check if file already exists. If so then use os.stat to determine the modification date?
I'm a little concerned about loading all those filename's into memory. And wondering if there is a more pythonic way of doing things...
Python 2.6 Windows 7
Upvotes: 3
Views: 7034
Reputation: 13410
You can do it with O(n)
complexity. The solutions with sort
have O(n*log(n))
complexity.
import os
from collections import namedtuple
directory = #file directory
os.chdir(directory)
newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file = newest_files.get(name)
this_file_date = os.path.getmtime(file_name)
if cashed_file is None:
newest_files[name] = Entry(this_file_date,file_name)
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
newest_files
is a dictonary having file names without extensions as keys with values of named tuples which hold file full file name and modification date. If the new file that is encountered is inside the dictionary, its date is compared to the stored in the dictionary one and it is replaced if necessary.
In the end you have a dictionary with the most recent files.
Then you may use this list to perform the second pass. Note, that lookup complexity in the dictionary is O(1)
. So the overall complexity of looking all n
files in the dictionary is O(n)
.
For example, if you want to leave only the newest files with the same name and delete the other, this can be achieved in the following way:
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file_name = newest_files.get(name).file_name
if file_name != cashed_file_name: #it's not the newest with this name
os.remove(file_name)
As suggested by Blckknght in the comments, you can even avoid the second pass and delete the older file as soon as you encounter the newer one, just by adding one line of the code:
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
os.remove(cashed_file.file_name) #this line added
Upvotes: 8
Reputation: 208455
First, get a list of file names and sort them. This will put any duplicates next to each other.
Then, strip off the file extension and compare to neighbors, os.path.splitext()
and itertools.groupby()
may be useful here.
Once you have grouped the duplicates, pick the one you want to keep using os.stat()
.
In the end your code might looks something like this:
import os, itertools
files = os.listdir(base_directory)
files.sort()
for k, g in itertools.groupby(files, lambda f: os.path.splitext(f)[0]):
dups = list(g)
if len(dups) > 1:
# figure out which file(s) to remove
You shouldn't have to worry about memory here, you're looking at something on the order of a couple of megabytes.
Upvotes: 2
Reputation: 17629
For the filename counter you could use a defaultdict that stores how many times each file appears:
import os
from collections import defaultdict
counter = defaultdict(int)
for file_name in file_names:
file_name = os.path.splitext(os.path.basename(file_name))[0]
counter[file_name] += 1
Upvotes: 0