Reputation: 75
I'm am trying to through files in a directory and find duplicates and delete them. I have 29 000 files in the directory so doing a brute force will take more than a day.
I have filenames that are as follow:
"some_file_name" "some-file-name"
So one name has underscores and the other one has dashes and sometimes they are 2 or three spots apart.
So how do I have my inner loop start at the outer loop's position in the directory and make it check only the next 10?
Here is my brute force code:
import glob, os
os.chdir("C:/Dir/dir")
for file in glob.glob("*"):
temp = file
temp = temp.replace("-", " ")
temp = temp.replace("_", " ")
#How do I start this loop where file is currently at and continue for the next 10 files
for file2 in glob.glob("*"):
temp2 = file2
temp2 = temp2.replace("-", " ")
temp2 = temp2.replace("_", " ")
if temp == temp2:
os.remove(file2)
Upvotes: 2
Views: 154
Reputation: 5068
You could use a dictionary and put the "simple name" (without _ or -) as the key and all the real filenames as values:
import glob, os
def extendDictValue(dDict, sKey, uValue):
if sKey in dDict:
dDict[sKey].append(uValue)
else:
dDict[sKey] = [uValue]
os.chdir("C:/Dir/dir")
filenames_dict = {}
for filename in glob.glob("*"):
simple_name = filename.replace("-", " ").replace("_", " ")
extendDictValue(filenames_dict, simple_name, filename)
for simple_name, filenames in filenames_dict.items():
if len(filenames) > 1:
filenames.pop(0)
for filename in filenames:
os.remove(filename)
Upvotes: 0
Reputation: 338118
From what I understand from your question, you want to delete similarly named files from a directory. I think your approach ("look at the next 10 filenames or so") is too imprecise and too complicated.
The condition is, when both a file some_file_name
and a file some-file-name
exist, delete one of them.
This can be done very easily by building a list of filenames and for each entry check if a filename with underscores instead of dashes also exists and if it does, delete it.
The following uses a set
to do this, because sets have very good look-up characteristics, i.e some_value in some_set
is much faster than it would be with lists. It also avoids excessive file-exists checks (like calling os.path.isfile(file)
), since we already know all files that exist from building the set.
import glob, os
filenames = {file for file in glob.glob(r"C:\Dir\dir\*")}
for file in filenames:
delete_candidate = file.replace("-", "_")
if delete_candidate != file and delete_candidate in filenames:
os.remove(delete_candidate)
print("deleted " + delete_candidate)
{x for x in iterable}
is a set comprehension, it builds a set from a list of values. It works just like list comprehensions.
Upvotes: 3