Rishabh Dixit
Rishabh Dixit

Reputation: 25

Using python code to remove duplicate files from a directory and subdirectory

I am trying to iterate through directories and subdirectories to find duplicate files but issue encountered here is script is giving some errors:

Traceback (most recent call last):
  File "./fileDupchknew.py", line 29, in <module>
    dup_fileremove(dirname)
  File "./fileDupchknew.py", line 26, in dup_fileremove
    os.remove(filepath)
  OSError: [Errno 21] Is a directory: '/tmp/rishabh-test/new-test'

Script:

#!/usr/bin/python
import os
import hashlib
import sys


dirname = sys.argv[1] os.chdir(dirname)

 def dup_fileremove(dir):
    duplicate = set()
    os.chdir(dir)
    path=os.getcwd()
    print ("The dir is: ", path)
    for filename in os.listdir(dir):
        filehash = None
        filepath=os.path.join(dir, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
        if filehash not in duplicate:
            duplicate.add(filehash)
        else:
            os.remove(filepath)
            print("removed : ", filepath)

dup_fileremove(dirname)

Upvotes: 2

Views: 5174

Answers (2)

Anand S Kumar
Anand S Kumar

Reputation: 90889

Since you do not want to delete directories (as can be seen from comments in question) -

No i don't want to delete directories

If the above is the case, then your issue occurs because you are not creating filehash for the directories. Because when you do not create a filehash for the directory, you get the filehash as None , and for the first directory, None is not present in the duplicates set , so it adds None to the set. From next directory onwards, it sees that None is already present in the set() , hence it tries to use os.remove() on it causing the issue.

A simple fix would be to check whether filehash is None or not , before trying to remove as well as before adding to set. Example -

#!/usr/bin/python
import os 
import hashlib
import sys


dirname = sys.argv[1] 
os.chdir(dirname)

 def dup_fileremove(dir):
    duplicate = set()
    os.chdir(dir)
    path=os.getcwd()
    print ("The dir is: ", path)
    for filename in os.listdir(dir):
        filehash = None
        filepath=os.path.join(dir, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
        if filehash is not None and filehash not in duplicate:
            duplicate.add(filehash)
        elif filehash is not None:
            os.remove(filepath)
            print("removed : ", filepath)

dup_fileremove(dirname)

Upvotes: 1

PM 2Ring
PM 2Ring

Reputation: 55469

You're actually lucky you got that error message, otherwise your code would have deleted directories!

The problem is that after control returns from the recursive call to

dup_fileremove(filepath)

it then continues on to

if filehash not in duplicate:

You don't want that!

A simple way to fix it is to put a continue statement after dup_fileremove(filepath).

But a much better fix is to indent the if filehash not in duplicate: stuff so that it's aligned with the filehash = hashlib.md5(file(filepath).read()).hexdigest() line.

For example:

#!/usr/bin/python
import os 
import hashlib
import sys

def dup_fileremove(dirname):
    duplicate = set()
    os.chdir(dirname)
    path=os.getcwd()
    print ("The dirname is: ", path)
    for filename in os.listdir(dirname):
        filehash = None
        filepath=os.path.join(dirname, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
            if filehash not in duplicate:
                duplicate.add(filehash)
            else:
                os.remove(filepath)
                print("removed : ", filepath)

dirname = sys.argv[1] 
os.chdir(dirname)

dup_fileremove(dirname)

I haven't tested this modified version of your code. It looks ok, but I make no guarantees. :)

BTW, it is recommended to not use the file() class directly to open files. In Python 3, file() no longer exists, but even in Python the docs have recommended the use of the open() function since at least Python 2.5, if not earlier.

Upvotes: 1

Related Questions