ian_informatics
ian_informatics

Reputation: 51

How can I make this search faster in python?

I'm searching for a value from one file in the lines of another. The exact value will only occur once in the search file. How can I make this process faster? Here is my current code:

filltaxlist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")

for line in filltaxalist:
    line = line.strip()
    taxid = re.split("\t", line)
    lookup = taxid[5] # this value is a number and I need the exact match only so I convert it to an integer
    int1 = int(lookup)
    for line in open("File to search.txt", "r"):
        data = re.split(',', line)
        hit = int(data[0]) # every value in this file is a number separated by a ,
        if lookup in line:
            if int1 == hit:
                fulltaxa.write(line)

This works fine as it is written just very slow. Also the file that I am searching in is over a GB in size. The

Example of filltaxlist line:

cvvel_1234    403454663    29.43    3e-30    55.55555555234    1172189
cvell_1444    2342333      30.00    1e-50    34.34584359345    5911
cvell_1444    234230055    23.23    1e-60    32.23445983454    46245
cvell_1444    233493003    23.44    1e-43    35.23595604593    46245

What fulltaxa should return:

1172189, 5943, 1002030, 12345
5911, 11234, 112356, 234, 3456, 44568, 78356
46245, 123, 3432456, 123488976, 23564, 334
46245, 123, 3432456, 123488976, 23564, 334

Upvotes: 3

Views: 305

Answers (2)

cmh
cmh

Reputation: 10927

Use a database

As others have mentioned, the easiest approach is probably going to be dumping this into a db (e.g. sqllite). You can use python bindings if you need to interface with the language.

Pure Python Solution

You read fulltaxa entirely for each entry in filltaxlist (due to the order of the nesting), it will be more efficient to cache all your queries first, then read fulltaxa once only, then sort the output to regain the order of fulltaxa.

As the order of the queries is import, we should use a FIFO structure - a deque will do nicely in our case.

from collections import defaultdict
filltaxlist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")

possibles = {}
for i, line in enumerate(filltaxalist):
    line = line.strip()
    taxid = re.split("\t", line)
    lookup = taxid[5] # this value is a number and I need the exact match only so I covert it to an integer
    int1 = int(lookup)
    possibles[int1] = i

output_lines = defaultdict(list)
for line in open("File to search.txt", "r"):
    data = re.split(',', line)
    hit = int(data[0]) # every value in this file is a number separated by a ,
    if hit in possibles:
        output_lines[possibles[hit]].append(line)

fulltaxa.writelines(line for lines in output_lines.values() for line in lines)

When you run out of queries the above code will throw an IndexError

Some other minor improvements.

data = re.split(',', line)

is probably slower than

data = line.split(',')

but you should profile to ensure that this is meaninfgul in your case.

Upvotes: 4

morningstar
morningstar

Reputation: 9132

Your algorithm is O(m * n). It's possible to make an O(m + n) algorithm instead by using a dictionary. Even if m is small, it's probably a significant improvement in Python, where the constant factor of dictionary access is not that much different from any other statement.

filltaxalist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")

filltaxadict = {}
for i, line in enumerate(filltaxalist):
    line = line.strip()
    taxid = re.split("\t", line)
    lookup = taxid[5] # this value is a number and I need the exact match only so I convert it to an integer
    int1 = int(lookup)

    filltaxadict[int1] = i

results = [[]] * len(filltaxadict)
for line in open("File to search.txt", "r"):
    data = re.split(',', line)
    hit = int(data[0]) # every value in this file is a number separated by a ,
    match = filltaxadict.get(hit)
    if match is not None:
        results[match].append(line)

for result in results:
    fulltaxa.writelines(result)

This handles duplicates and in the right order; slightly simpler if you don't need to. The file to search can be big; this won't keep its contents in memory, only (a portion of) the contents of filltaxalist, which I assume is not unusually large.

Upvotes: 1

Related Questions