Hello
Hello

Reputation: 47

Computational tractability of algorithm for matching names in two files in python

So I have two .txt files that I'm trying to match up. The first .txt file is just lines of about 12,500 names.

John Smith
Jane Smith
Joe Smith

The second .txt file also contains lines with names (that might repeat) but also extra info, about 17GB total.

584 19423   john smith  John Smith  79946792    5   5   11  2016-06-24
584 19434   john smith  John Smith  79923732    5   4   11  2018-03-14
584 19423   jane smith  Jane Smith  79946792    5   5   11  2016-06-24

My goal is to find all the names from File 1 in File 2, and then spit out the File 2 lines that contain any of those File 1 names.

Here is my python code:

with open("Documents/File1.txt", "r") as t:
    terms = [x.rstrip('\n') for x in t]
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w: 
    for line in f:
        if any([term in line for term in terms]):
            w.write(line)

So this code definitely works, but it has been running for 3 days (and is still going!!!). I did some back-of-the-envelope calculations, and I'm very worried that my algorithm is computationally intractable (or hyper inefficient) given the size of the data.

Would anyone be able to provide feedback re: (1) whether this is actually intractable and/or extremely inefficient and if so (2) what an alternative algorithm might be?

Thank you!!

Upvotes: 1

Views: 83

Answers (1)

C.Nivs
C.Nivs

Reputation: 13106

First, when testing membership, set and dict are going to be much, much faster, so terms should be a set:

with open("Documents/File1.txt", "r") as t:
    terms = set(line.strip() for line in t)

Next, I would split each line into a list, and check if the name is in the set, not if members of the set are in the line, which is O(N) where N is the length of each line. This way you can directly pick out the column numbers (via slicing) that contain the first and last name:

with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w: 
    for line in f:
        # split the line on whitespace
        names = line.split()

        # Your names seem to occur here
        name = ' '.join(names[4:6])

        if name in terms:
            w.write(line)

Upvotes: 2

Related Questions