Iterate over 2 files in each folder and compare them

Question

I compare two text files and print out the results to a 3rd file. I am trying to make it so the script i'm running would iterate over all of the folders that have two text files in them, in the CWD of the script.

What i have so far:

import os
import glob

path = './'
for infile in glob.glob( os.path.join(path, '*.*') ):
    print('current file is: ' + infile)
    with open (f1+'.txt', 'r') as fin1, open(f2+'.txt', 'r') as fin2:

Would this be a good way to start the iteration process?

It's not the most clear code but it gets the job done. However, i'm pretty sure i need to take the logic out of the read / write methods but i'm not sure where to start.

What i'm basically trying to do is have a script iterate over all of the folders in its CWD, open each folder, compare the two text files inside, write a 3rd text file to the same folder, then move on to the next.

Another method i have tried is as follows:

import os

rootDir = 'C:\Python27\test'
for dirName, subdirList, fileList in os.walk(rootDir):
    print('Found directory: %s' % dirName)
    for fname in fileList:
        print('	%s' % fname)

And this outputs the following (to give you a better example of the file structure:

Found directory: C:\Python27	est
    test.py
Found directory: C:\Python27	est\asdd
    asd1.txt
    asd2.txt
Found directory: C:\Python27	est\chro
    ch1.txt
    ch2.txt
Found directory: C:\Python27	est\hway
    hw1.txt
    hw2.txt

Would it be wise to put the compare logic under the for fname in fileList? How do i make sure it compares the two text files inside the specific folder and not with other fnames in the fileList?

This is the full code that i am trying to add this functionality into. I appologize for the Frankenstein nature of it but i am still working on a refined version but it does not work yet.

from collections import defaultdict
from operator import itemgetter
from itertools import groupby
from collections import deque
import os



class avs_auto:

    def load_and_compare(self, input_file1, input_file2, output_file1, output_file2, result_file):
        self.load(input_file1, input_file2, output_file1, output_file2)
        self.compare(output_file1, output_file2)
        self.final(result_file)

    def load(self, fileIn1, fileIn2, fileOut1, fileOut2):
        with open(fileIn1+'.txt') as fin1, open(fileIn2+'.txt') as fin2:
            frame_rects = defaultdict(list)
            for row in (map(str, line.split()) for line in fin1):
                id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
                frame_rects[frame].append(id)
                frame_rects[frame].append(rect)
            frame_rects2 = defaultdict(list)
            for row in (map(str, line.split()) for line in fin2):
                id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
                frame_rects2[frame].append(id)
                frame_rects2[frame].append(rect)

        with open(fileOut1+'.txt', 'w') as fout1, open(fileOut2+'.txt', 'w') as fout2:
            for frame, rects in sorted(frame_rects.iteritems()):
                fout1.write('{{{}:{}}}
'.format(frame, rects))
            for frame, rects in sorted(frame_rects2.iteritems()):
                fout2.write('{{{}:{}}}
'.format(frame, rects))


    def compare(self, fileOut1, fileOut2):
        with open(fileOut1+'.txt', 'r') as fin1:
            with open(fileOut2+'.txt', 'r') as fin2:
                lines1 = fin1.readlines()
                lines2 = fin2.readlines()
                diff_lines = [l.strip() for l in lines1 if l not in lines2]
                diffs = defaultdict(list)
                with open(fileOut1+'x'+fileOut2+'.txt', 'w') as result_file:
                    for line in diff_lines:
                        d = eval(line)
                        for k in d:
                            list_ids = d[k]
                            for i in range(0, len(d[k]), 2):
                                diffs[d[k][i]].append(k)
                    for id_ in diffs:
                        diffs[id_].sort()
                        for k, g in groupby(enumerate(diffs[id_]), lambda (i, x): i - x):
                            group = map(itemgetter(1), g)
                            result_file.write('{0} {1} {2}
'.format(id_, group[0], group[-1]))


    def final(self, result_file):
        with open(result_file+'.txt', 'r') as fin:
            lines = (line.split() for line in fin)
            for k, g in groupby(lines, itemgetter(0)):
                fst = next(g)
                lst = next(iter(deque(g, 1)), fst)
                with open('final/{}.avs'.format(k), 'w') as fout:
                    fout.write('video0=ImageSource("old\%06d.jpeg", {}-3, {}+3, 15)
'.format(fst[1], lst[2]))
                    fout.write('video1=ImageSource("new\%06d.jpeg", {}-3, {}+3, 15)
'.format(fst[1], lst[2]))
                    fout.write('video0=BilinearResize(video0,640,480)
')
                    fout.write('video1=BilinearResize(video1,640,480)
')
                    fout.write('StackHorizontal(video0,video1)
')
                    fout.write('Subtitle("ID: {}", font="arial", size=30, align=8)'.format(k))

using the load_and_compare() function, i define two input text files, two output text files, a file for the comparison results and a final phase that writes many files for all of the differences.

What i am trying to do is have this whole class run on the current working directory and go through every sub folder, compare the two text files, and write everything into the same folder, specifically the final() results.

blubberdiblub · Accepted Answer

You can indeed use os.walk(), since that already separates the directories from the files. You only need the directories it returns, because that's where you're looking for your 2 specific files.

You could also use os.listdir() but that returns directories as well files in the same list, so you would have to check for directories yourself.

Either way, once you have the directories, you iterate over them (for subdir in dirnames) and join the various path components you have: The dirpath, the subdir name that you got from iterating over the list and your filename.

Assuming there are also some directories that don't have the specific 2 files, it's a good idea to wrap the open() calls in a try..except block and thus ignore the directories where one of the files (or both of them) doesn't exist.

Finally, if you used os.walk(), you can easily choose if you only want to go into directories one level deep or walk the whole depth of the tree. In the former case, you just clear the dirnames list by dirnames[:] = []. Note that dirnames = [] wouldn't work, since that would just create a new empty list and put that reference into the variable instead of clearing the old list.

Replace the print("do something ...") with your program logic.

#!/usr/bin/env python

import errno
import os

f1 = "test1"
f2 = "test2"

path = "."
for dirpath, dirnames, _ in os.walk(path):
    for subdir in dirnames:
        filepath1, filepath2 = [os.path.join(dirpath, subdir, f + ".txt") for f in f1, f2]
        try:
            with open(filepath1, 'r') as fin1, open(filepath2, 'r') as fin2:
                print("do something with " + str(fin1) + " and " + str(fin2))
        except IOError as e:
            # ignore directiories that don't contain the 2 files
            if e.errno != errno.ENOENT:
                # reraise exception if different from "file or directory doesn't exist"
                raise

    # comment the next line out if you want to traverse all subsubdirectories
    dirnames[:] = []

Edit:

Based on your comments, I hope I understand your question better now.

Try the following code snippet instead. The overall structure stays the same, only now I'm using the returned filenames of os.walk(). Unfortunately, that would also make it harder to do something like "go only into the subdirectories 1 level deep", so I hope walking the tree recursively is fine with you. If not, I'll have to add a little code to later.

#!/usr/bin/env python

import fnmatch
import os

filter_pattern = "*.txt"

path = "."
for dirpath, dirnames, filenames in os.walk(path):
    # comment this out if you don't want to filter
    filenames = [fn for fn in filenames if fnmatch.fnmatch(fn, filter_pattern)]

    if len(filenames) == 2:
        # comment this out if you don't want the 2 filenames to be sorted
        filenames.sort(key=str.lower)

        filepath1, filepath2 = [os.path.join(dirpath, fn) for fn in filenames]
        with open(filepath1, 'r') as fin1, open(filepath2, 'r') as fin2:
            print("do something with " + str(fin1) + " and " + str(fin2))

I'm still not really sure what your program logic does, so you will have to interface the two yourself.

However, I noticed that you're adding the ".txt" extension to the file name explicitly all over your code, so depending on how you are going to use the snippet, you might or might not need to remove the ".txt" extension first before handing the filenames over. That would be achieved by inserting the following line after or before the sort:

        filenames = [os.path.splitext(fn)[0] for fn in filenames]

Also, I still don't understand why you're using eval(). Do the text files contain python code? In any case, eval() should be avoided and be replaced by code that's more specific to the task at hand.

If it's a list of comma separated strings, use line.split(",") instead.

If there might be whitespace before or after the comma, use [word.strip() for word in line.split(",")] instead.

If it's a list of comma separated integers, use [int(num) for num in line.split(",")] instead - for floats it works analogously.

etc.

Iterate over 2 files in each folder and compare them

Answers (1)

Related Questions