Reputation: 183
I'm trying to match (what are network logon usernames) in two files. The All is a text file of names I'm (or will be) interested to match. Currently, I'm doing something like this:
def find_files(directory, pattern):
#directory= (raw_input("Enter a directory to search for Userlists: ")
directory=("c:\\TEST")
os.chdir(directory)
for root, dirs, files in os.walk(directory):
for basename in files:
if fnmatch.fnmatch(basename, pattern):
filename = os.path.join(root, basename)
yield filename
for filename in find_files('a-zA-Z0-9', '*.txt'):
with open (filename, "r") as file1:
with open ("c:/All.txt", "r") as file2:
list1 = file1.readlines()[18:]
list2 = file2.readlines()
for i in list1:
for j in list2:
if i == j:
I'n new to python and am wondering if this is the best, and most efficient way of doing this. It seems, to me even as a newbie a little clunky, but with my current coding knowledge is the best I can come up with at the moment. Any help and advice would be gratefully received.
Upvotes: 1
Views: 1455
Reputation: 1125018
You want to read one file into memory first, storing it in a set. Membership testing in a set is very efficient, much more so than looping over the lines of the second file for every line in the first file.
Then you only need to read the second file, and line by line process it and test if lines match.
What file you keep in memory depends on the size of All.txt
. If it is < 1000 lines or so, just keep that in memory and compare it to the other files. If All.txt
is really large, re-open it for every file1
you process, and read only the first 18 lines of file1
into memory and match those against every line in All.txt
, line by line.
To read just 18 lines of a file, use itertools.islice()
; files are iterables and islice()
is the easiest way to pick a subset of lines to read.
Reading All.txt
into memory first:
from itertools import islice
with open ("c:/All.txt", "r") as all:
# storing lines without whitespace to make matching a little more robust
all_lines = set(line.strip() for line in all)
for filename in find_files('a-zA-Z0-9', '*.txt'):
with open(filename, "r") as file1:
for line in islice(file1, 18):
if line.strip() in all_lines:
# matched line
If All.txt
is large, store those 18 lines of each file in a set first, then re-open All.txt
and process it line by line:
for filename in find_files('a-zA-Z0-9', '*.txt'):
with open(filename, "r") as file1:
file1_lines = set(line.strip() for line in islice(file1, 18))
with open ("c:/All.txt", "r") as all:
for line in all:
if line.strip() in file1_lines:
# matched line
Note that you do not have to change directories in find_files()
; os.walk()
is already passed the directory name. The fnmatch
module also has a .filter()
method, use that to loop over files
instead of using fnmatch.fnmatch()
on each file individually:
def find_files(directory, pattern):
directory = "c:\\TEST"
for root, dirs, files in os.walk(directory):
for basename in fnmatch.filter(files, pattern):
yield os.path.join(root, basename)
Upvotes: 4