Reputation: 11
Im trying to do a comparison of two files, the files have around 70k lines and with my current algorithm is taking me around 5 minutes to fully compare all of them.
Essentially what Im doing is taking all the lines of both files and putting them into lists so it looks like this.
compare_list_new=[['Albert','V4','25.000','45.000','1.3500'],
['James','V4','22.000','43.000','1.4000'], ['James','V5','23.000','41.000','1.3000']]
compare_list_old=[['Albert','V4','25.000','45.000','1.3900'],
['James','V4','22.000','43.000','1.2000'], ['James','V5','23.000','41.000','1.2000']]
The idea is that both files have similar names so to find the new entry in the old entry, we must search based on the coordinates, so if I wanted to find a specific James from the new to the old I would have to use the '22.000', '43.000'.
After finding the entry then I take the 1.4000 from the new file and the 1.2000 from the old file and subtract them to find delta from the old to new.
This is the current Algorithm Im using:
# This is not important
import time
import timeit
import bisect
from operator import itemgetter
import time
compare=open("factor.output.new.txt","w")
compare_list_new=[]
compare_list_old=[]
newlist=[]
#File Count algorithm
start = time.time() # Tracks execution time
def list_create(fname): #Makes the list in the appropriate format
newlist=[]
with open(fname) as file:
for i, line in enumerate(file):
if i>6:
for line in file:
lines_list=line.split(" ")
del lines_list[0]
del lines_list[2:29]
del lines_list[5:12]
newlist.append(lines_list)
return newlist
#Creates lists and sorts them
compare_list_new=list_create("par_iop.pwr.sfactor.output_new.ipf")
compare_list_new=sorted(compare_list_new, key=itemgetter(2))
compare_list_old=list_create("par_iop.pwr.sfactor.output_old.ipf")
compare_list_old=sorted(compare_list_old, key=itemgetter(2))
compare.write("Name Version Coordinate_x Coordinate_y Sfactordelta FLAG\n")
compare_list_copy=compare_list_old #Makes a copy of the list
for item in compare_list_new: # compares both lists
end = time.time()
print(end - start)
for line in compare_list_old:
if item[0:4] == line[0:4]:
s1=float(item[4])
s2 = float(line[4])
delta=s1-s2
delta=format(delta,'.4f')
item[4]=str(delta)
text = " ".join(item)
compare.write(text +" " +"\n")
compare_list_copy.remove(line)
match=1
break
if(match==1):
compare_list_old=compare_list_copy
match=0
else:
text=" ".join(item)
compare.write(text + " " + "ITEM NOT FOUND IN OLD FILE BUT IS IN NEW FILE""\n")
try:
compare_list_copy.remove(line)
except ValueError:
pass
compare_list_old = compare_list_copy
compare.close()
Essentially the part that compares both lists what it does after sorting them if they are match then it will do the operation to obtain the delta and remove it from the copy and then makes the old equal to the copy so that it does not remove items while iterating over the list. If the item is not a match then it means is not in the old file but it is in the new file.
I want something that could potentially make this process much faster.
Upvotes: 1
Views: 290
Reputation: 1128
compare_list_new = [['Albert', 'V4', '25.000', '45.000', '1.3500'],
['James', 'V4', '22.000', '43.000', '1.4000'],
['James', 'V5', '23.000', '41.000', '1.3000']]
compare_list_old = [['Albert', 'V4', '25.000', '45.000', '1.3900'],
['James', 'V4', '22.000', '43.000', '1.2000'],
['James', 'V5', '23.000', '41.000', '1.2000']]
d = {}
for l in compare_list_old:
# construct tuple as key and value as 'float' value
d[tuple(l[0:3])] = l[4]
print(d)
# {('Albert', 'V4', '25.000'): '1.3900', ('James', 'V4', '22.000'): '1.2000', ('James', 'V5', '23.000'): '1.2000'}
print(d[('Albert', 'V4', '25.000')])
# 1.3900
for item in compare_list_new:
old_float_val = d[tuple(item[0:3])]
new_float_val = item[4]
# continue whatever calculation here
Idea is to construct old list as dictionary with key and value. By this, we are not iterating the second list with respect to first list.
Upvotes: 0
Reputation: 365597
There's a lot of code here, and the indentation is clearly incorrect so I don't even know what exactly the logic is supposed to be, and there's no indication of which part you think is slow (or how you know), but one thing immediately jumps out:
compare_list_copy.remove(line)
… and another remove
later.
First, whenever you call lst.remove(val)
, the list has to do a linear search, comparing each of the elements to val
. But you already know the index of the element you want (or, rather, you could know it by just using enumerate
), so this whole search is wasted; just del lst[idx]
instead.
Second, whether you remove
or del
, you're still deleting from the middle of an array. That means shifting all of the subsequent elements up one slot. It's got a much faster constant (it's just a big memmove, rather than a bunch of calls to a compare function), but it's still linear.
And you're doing this inside your inner loop. So, you're multiplying an extra factor of N
onto your already quadratic time. And any effort you put into doing the search in logarithmic rather than linear time via bisect
is going to be wasted if you just follow that logarithmic search with a linear search over the same data.
If you need something you can search in logarithmic time, and also modify in logarithmic time, what you want is some kind of tree (or tree-list structure, like a skiplist). There are nice libraries wrapping all kinds of binary tree and b-tree variants on PyPI, or you can look up the algorithms on Wikipedia.
Or you can just grab something like the Sorted Containers library that wraps things up at a higher level. For example, a sorteddict
acts a lot like a dict
, but you can search for the nearest key instead of an exact match, or all keys within a given range, etc. Under the covers, it works with some kind of hybrid rope-of-btrees or something, but you don't need to care about those details; what matters is that it guarantees all of the operations you need in logarithmic time.
Once you've done that, at least one of your two outer loops can also be profitably turned into a logarithmic search (which you'll get almost for free by using a tree).
At which point your total time is O(log**2 N * N)
instead of O(N**3)
, which is a huge difference.
If you're not used to dealing with performance in algorithmic complexity terms, consider this: with only 1000 elements, cubic time takes 1000*1000*1000
= 1 billion steps; log-squared-linear time takes 10*10*1000
= 100 thousand steps. That's the difference between days and seconds.
Upvotes: 3
Reputation: 447
Your current comparison is at least quadratic (because of the nested loop). It is faster to generate a dictionary from the first list (linear time), where the keys are a tuple of name and the first 2 coordinates (seems like they are the same for the new and old file) and then for each item in the second list, check if that key is in your dictionary (linear time again).
Upvotes: 1