Match data in two files

Question

I'm trying to match (what are network logon usernames) in two files. The All is a text file of names I'm (or will be) interested to match. Currently, I'm doing something like this:

def find_files(directory, pattern):
    #directory= (raw_input("Enter a directory to search for Userlists: ")
    directory=("c:\TEST")
    os.chdir(directory)
    for root, dirs, files in os.walk(directory):
        for basename in files:
            if fnmatch.fnmatch(basename, pattern):
                filename = os.path.join(root, basename)
                yield filename


for filename in find_files('a-zA-Z0-9', '*.txt'):
    with open (filename, "r") as file1:
       with open ("c:/All.txt", "r") as file2:
            list1 = file1.readlines()[18:]
            list2 = file2.readlines()
            for i in list1:
                for j in list2:
                    if i == j:

I'n new to python and am wondering if this is the best, and most efficient way of doing this. It seems, to me even as a newbie a little clunky, but with my current coding knowledge is the best I can come up with at the moment. Any help and advice would be gratefully received.

Martijn Pieters · Accepted Answer

You want to read one file into memory first, storing it in a set. Membership testing in a set is very efficient, much more so than looping over the lines of the second file for every line in the first file.

Then you only need to read the second file, and line by line process it and test if lines match.

What file you keep in memory depends on the size of All.txt. If it is < 1000 lines or so, just keep that in memory and compare it to the other files. If All.txt is really large, re-open it for every file1 you process, and read only the first 18 lines of file1 into memory and match those against every line in All.txt, line by line.

To read just 18 lines of a file, use itertools.islice(); files are iterables and islice() is the easiest way to pick a subset of lines to read.

Reading All.txt into memory first:

from itertools import islice

with open ("c:/All.txt", "r") as all:
    # storing lines without whitespace to make matching a little more robust
    all_lines = set(line.strip() for line in all)

for filename in find_files('a-zA-Z0-9', '*.txt'):
    with open(filename, "r") as file1:
        for line in islice(file1, 18):
            if line.strip() in all_lines:
                 # matched line

If All.txt is large, store those 18 lines of each file in a set first, then re-open All.txt and process it line by line:

for filename in find_files('a-zA-Z0-9', '*.txt'):
    with open(filename, "r") as file1:
        file1_lines = set(line.strip() for line in islice(file1, 18))
    with open ("c:/All.txt", "r") as all:
        for line in all:
            if line.strip() in file1_lines:
                 # matched line

Note that you do not have to change directories in find_files(); os.walk() is already passed the directory name. The fnmatch module also has a .filter() method, use that to loop over files instead of using fnmatch.fnmatch() on each file individually:

def find_files(directory, pattern):
    directory = "c:\TEST"
    for root, dirs, files in os.walk(directory):
        for basename in fnmatch.filter(files, pattern):
            yield os.path.join(root, basename)

Match data in two files

Answers (1)

Related Questions