Reputation: 51
I'm searching for a value from one file in the lines of another. The exact value will only occur once in the search file. How can I make this process faster? Here is my current code:
filltaxlist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")
for line in filltaxalist:
line = line.strip()
taxid = re.split("\t", line)
lookup = taxid[5] # this value is a number and I need the exact match only so I convert it to an integer
int1 = int(lookup)
for line in open("File to search.txt", "r"):
data = re.split(',', line)
hit = int(data[0]) # every value in this file is a number separated by a ,
if lookup in line:
if int1 == hit:
fulltaxa.write(line)
This works fine as it is written just very slow. Also the file that I am searching in is over a GB in size. The
Example of filltaxlist
line:
cvvel_1234 403454663 29.43 3e-30 55.55555555234 1172189
cvell_1444 2342333 30.00 1e-50 34.34584359345 5911
cvell_1444 234230055 23.23 1e-60 32.23445983454 46245
cvell_1444 233493003 23.44 1e-43 35.23595604593 46245
What fulltaxa
should return:
1172189, 5943, 1002030, 12345
5911, 11234, 112356, 234, 3456, 44568, 78356
46245, 123, 3432456, 123488976, 23564, 334
46245, 123, 3432456, 123488976, 23564, 334
Upvotes: 3
Views: 305
Reputation: 10927
As others have mentioned, the easiest approach is probably going to be dumping this into a db (e.g. sqllite). You can use python bindings if you need to interface with the language.
You read fulltaxa
entirely for each entry in filltaxlist
(due to the order of the nesting), it will be more efficient to cache all your queries first, then read fulltaxa
once only, then sort the output to regain the order of fulltaxa
.
As the order of the queries is import, we should use a FIFO structure - a deque will do nicely in our case.
from collections import defaultdict
filltaxlist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")
possibles = {}
for i, line in enumerate(filltaxalist):
line = line.strip()
taxid = re.split("\t", line)
lookup = taxid[5] # this value is a number and I need the exact match only so I covert it to an integer
int1 = int(lookup)
possibles[int1] = i
output_lines = defaultdict(list)
for line in open("File to search.txt", "r"):
data = re.split(',', line)
hit = int(data[0]) # every value in this file is a number separated by a ,
if hit in possibles:
output_lines[possibles[hit]].append(line)
fulltaxa.writelines(line for lines in output_lines.values() for line in lines)
When you run out of queries the above code will throw an IndexError
Some other minor improvements.
data = re.split(',', line)
is probably slower than
data = line.split(',')
but you should profile to ensure that this is meaninfgul in your case.
Upvotes: 4
Reputation: 9132
Your algorithm is O(m * n). It's possible to make an O(m + n) algorithm instead by using a dictionary. Even if m is small, it's probably a significant improvement in Python, where the constant factor of dictionary access is not that much different from any other statement.
filltaxalist = open("file with query number.txt", "rw")
fulltaxa = open("output file with hit line match", "rw")
filltaxadict = {}
for i, line in enumerate(filltaxalist):
line = line.strip()
taxid = re.split("\t", line)
lookup = taxid[5] # this value is a number and I need the exact match only so I convert it to an integer
int1 = int(lookup)
filltaxadict[int1] = i
results = [[]] * len(filltaxadict)
for line in open("File to search.txt", "r"):
data = re.split(',', line)
hit = int(data[0]) # every value in this file is a number separated by a ,
match = filltaxadict.get(hit)
if match is not None:
results[match].append(line)
for result in results:
fulltaxa.writelines(result)
This handles duplicates and in the right order; slightly simpler if you don't need to. The file to search can be big; this won't keep its contents in memory, only (a portion of) the contents of filltaxalist, which I assume is not unusually large.
Upvotes: 1