FrankN
FrankN

Reputation: 33

Python for loop to go through directory files

I have files in two different directories that contain pickled lists of indexed text, as shown below, saved in the .out format:

(lp0 S'TCCTCTTGGAGCACCAGCTAATATTTCATCAGTATTCGCTGAATCTTCGGACATAGTTCA' p1 aS'TTCGGACATAGTTCATTCATATTTATTTGCCCAATACCCGCACGAAGAAGCCTTGCAGAC' p2 aS'AGAAGCCTTGCAGACACCGTGGCA' p3 a.

The task I am trying to accomplish is to open one file from the suspect text directory and compare it to each file in the source text directory, using python's difflib, print out a number indicating whether or not they match and then do the same with the rest of the files in the suspect text directory. (Side note: If anyone knows of a more detailed way to compare the two lists of indexed text, I am all ears, but it's far from a priority)

My current problem is with the for loop to accomplish this task, it doesn't work. By that I mean that I can cycle through the folders and they print out the folder names okay, but the contents of the files themselves don't change. The loop is currently only comparing one file from each directory multiple times and i don't know how to fix it.

Any and all suggestions are welcome, please feel free to ask any questions if my explanation has been clear enough.

Thanks. Also, I know this is a common problem and I have tried my best to look at previous answers and apply what they used, but I am struggling to do so as I am not very good at programming.

Thanks in advance!

F

Code is below:

import string
import pickle
import sys
import glob
import difflib


sourcePath = 'C:\Users\User\Sou2/*.out'
suspectPath = 'C:\Users\User\Susp2/*.out'
list_of_source_files = glob.glob(sourcePath)
list_of_suspect_files = glob.glob(suspectPath)


def get_source_files(list_of_source_files):

    for source_file_name in list_of_source_files:
        with open(source_file_name) as source_file:
            sourceText = pickle.load(source_file)
        return sourceText


get_suspect_files(list_of_suspect_files):

    for suspect_file_name in list_of_suspect_files:
        with open(suspect_file_name) as suspect_file:
            suspectText = pickle.load(suspect_file)
        return suspectText


def matching(sourceText,suspectText):

            matching = difflib.SequenceMatcher(None,sourceText,suspectText)
            print matching.ratio()


def main():

    for suspectItem in list_of_suspect_files:
        suspectText = get_suspect_files(list_of_suspect_files)
        print ('----------------SEPERATOR-----------------')
        for sourceItem in list_of_source_files:
            sourceText = get_source_files(list_of_source_files)
            matching(sourceText,suspectText)


main()

Current result:

----------------SEPERATOR-----------------
0.0
0.0
0.0
----------------SEPERATOR-----------------
0.0
0.0
0.0
----------------SEPERATOR-----------------
0.0
0.0
0.0
----------------SEPERATOR-----------------
0.0
0.0
0.0

This should be 1.0 for some of them as I intentionally putting matching indexed text to text the system.

Upvotes: 2

Views: 838

Answers (1)

Stuart
Stuart

Reputation: 9868

Your functions get_source_files and get_suspect_files each contain loops, but are returning on the first iteration of the loop. So that's why your programme only looks at the first file in each list.

Moreover the loops in those two functions are duplicated by the loops in the main function. In your main function you are never using the loop variables suspectItem and sourceItem, so those loops merely do the same thing several times.

Possibly, you are confusing yield and return, and somehow expecting your functions to behave like generators.

Something like this should work

def get_text(file_name):
    with open(file_name) as file:
        return pickle.load(file)

def matching(sourceText,suspectText):
    matching = difflib.SequenceMatcher(None,sourceText,suspectText)
    print matching.ratio()

def main():
    for suspect_file in list_of_suspect_files:
        print ('----------------SEPERATOR-----------------')
        suspect_text = get_text(suspect_file)
        for source_file in list_of_source_files:
            source_text = get_text(source_file)
            matching(source_text, suspect_text)

main()

Note that this repeats the loading of the source texts in each iteration. If this is slow, and the texts are not too long to fit in memory, you could store all of the source and suspect texts in lists instead.

Upvotes: 2

Related Questions