Reputation: 47
So I have two .txt files that I'm trying to match up. The first .txt file is just lines of about 12,500 names.
John Smith
Jane Smith
Joe Smith
The second .txt file also contains lines with names (that might repeat) but also extra info, about 17GB total.
584 19423 john smith John Smith 79946792 5 5 11 2016-06-24
584 19434 john smith John Smith 79923732 5 4 11 2018-03-14
584 19423 jane smith Jane Smith 79946792 5 5 11 2016-06-24
My goal is to find all the names from File 1 in File 2, and then spit out the File 2 lines that contain any of those File 1 names.
Here is my python code:
with open("Documents/File1.txt", "r") as t:
terms = [x.rstrip('\n') for x in t]
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w:
for line in f:
if any([term in line for term in terms]):
w.write(line)
So this code definitely works, but it has been running for 3 days (and is still going!!!). I did some back-of-the-envelope calculations, and I'm very worried that my algorithm is computationally intractable (or hyper inefficient) given the size of the data.
Would anyone be able to provide feedback re: (1) whether this is actually intractable and/or extremely inefficient and if so (2) what an alternative algorithm might be?
Thank you!!
Upvotes: 1
Views: 83
Reputation: 13106
First, when testing membership, set
and dict
are going to be much, much faster, so terms
should be a set
:
with open("Documents/File1.txt", "r") as t:
terms = set(line.strip() for line in t)
Next, I would split each line into a list, and check if the name is in the set, not if members of the set are in the line, which is O(N) where N is the length of each line. This way you can directly pick out the column numbers (via slicing) that contain the first and last name:
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w:
for line in f:
# split the line on whitespace
names = line.split()
# Your names seem to occur here
name = ' '.join(names[4:6])
if name in terms:
w.write(line)
Upvotes: 2