UserYmY
UserYmY

Reputation: 8554

How to compare all of the files in directory with each other two by two in Python?

I have a directory and I want to compare all of the files in it and get the percentage to the match between them. As the starting point, I decide to open one file and compare other files with that one:

filelist=[]
diff_list=[]
f= open("D:/Desktop/sample/ff69.txt")
flines= f.readlines()
path="D:/Desktop/sample"
for root, dirnames, filenames in os.walk(path):  
    for filename in fnmatch.filter(filenames, '*.txt'):   
        filelist.append(os.path.join(root, filename))


for m in filelist:
    g = open(m,'r')
    glines= g.readlines()



    d = difflib.Differ()
    #print d
    diffl= diff_list.append(d.compare(flines, glines))


print("".join(diff))#n_adds, n_subs, n_eqs, n_wiered = 0, 0, 0, 0
#

But my code those not work, which means that when I am printing it I get "None". any has any idea why? Or any better idea about comparing all of the files in a directory two by two?

Upvotes: 0

Views: 857

Answers (1)

g.d.d.c
g.d.d.c

Reputation: 47988

If you're attempting to compare files pairwise you probably want something closer to this:

files = os.listdir('root')
for idx, filename in enumerate(files):
  try:
    fcompare = files[idx + 1]
  except IndexError:
    # We've reached the last file.
    break
  # Actual diffing code.
  d = difflib.Differ()
  lines1 = open(filename).readlines()
  lines2 = open(fcompare).readlines()
  d.compare(lines1, lines2)

That will compare files 1-2, 2-3, 3-4, etc. It may be worth optimizing when you read the files in - file 2 is in use for loop iterations 1 and 2 - so shouldn't have its contents read twice if possible, but that may be premature optimization depending on the volume of files.

Upvotes: 2

Related Questions