rioZg
rioZg

Reputation: 580

Pythons os.walk() visits all folders instead of only the given folder

I wanted to use a simple script to get me all images below given folder and compare them/find duplicates.

Why invent the wheel when the first step of the solution is already there under: Finding duplicate files and removing them

But it fails already at the first step in the sense that it visits all the folders on a given USB flash drive. I stripped away all the hashing stuff and I am trying to get only the list of files but even that lasts forever and visits every file on the USB drive.

from __future__ import print_function   # py2 compatibility
from collections import defaultdict
import hashlib
import os
import sys


folder_to_check = "D:\FileCompareTest"

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
    hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
    hashes_full = {}   # dict of full_file_hash: full_path_to_file_string

    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            # get all files that have the same size - they are the collision candidates
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                try:
                    # if the target is a symlink (soft one), this will 
                    # dereference it - change the value to the actual target file
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    # not accessible (permissions, etc) - pass on
                    continue




check_for_duplicates(folder_to_check)

Instead of getting a hashes_by_size list in a couple of miliseconds I get stuck either in an eternal loop or the program exits hours later with all the files on USB.

What is it that I do not get about os.walk()?

Upvotes: 0

Views: 252

Answers (1)

lllrnr101
lllrnr101

Reputation: 2343

You should call

paths_to_check = []
paths_to_check.append(folder_to_check)
check_for_duplicates(paths_to_check)

The way you are calling, you are getting generators on every character of your path and not on your correct path.

Upvotes: 1

Related Questions