Reputation: 580
I wanted to use a simple script to get me all images below given folder and compare them/find duplicates.
Why invent the wheel when the first step of the solution is already there under: Finding duplicate files and removing them
But it fails already at the first step in the sense that it visits all the folders on a given USB flash drive. I stripped away all the hashing stuff and I am trying to get only the list of files but even that lasts forever and visits every file on the USB drive.
from __future__ import print_function # py2 compatibility
from collections import defaultdict
import hashlib
import os
import sys
folder_to_check = "D:\FileCompareTest"
def check_for_duplicates(paths, hash=hashlib.sha1):
hashes_by_size = defaultdict(list) # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
hashes_on_1k = defaultdict(list) # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
hashes_full = {} # dict of full_file_hash: full_path_to_file_string
for path in paths:
for dirpath, dirnames, filenames in os.walk(path):
# get all files that have the same size - they are the collision candidates
for filename in filenames:
full_path = os.path.join(dirpath, filename)
try:
# if the target is a symlink (soft one), this will
# dereference it - change the value to the actual target file
full_path = os.path.realpath(full_path)
file_size = os.path.getsize(full_path)
hashes_by_size[file_size].append(full_path)
except (OSError,):
# not accessible (permissions, etc) - pass on
continue
check_for_duplicates(folder_to_check)
Instead of getting a hashes_by_size list in a couple of miliseconds I get stuck either in an eternal loop or the program exits hours later with all the files on USB.
What is it that I do not get about os.walk()?
Upvotes: 0
Views: 252
Reputation: 2343
You should call
paths_to_check = []
paths_to_check.append(folder_to_check)
check_for_duplicates(paths_to_check)
The way you are calling, you are getting generators on every character of your path and not on your correct path.
Upvotes: 1