Darksen
Darksen

Reputation: 11

Making Comparisons of two lists faster

Im trying to do a comparison of two files, the files have around 70k lines and with my current algorithm is taking me around 5 minutes to fully compare all of them.

Essentially what Im doing is taking all the lines of both files and putting them into lists so it looks like this.

    compare_list_new=[['Albert','V4','25.000','45.000','1.3500'], 
     ['James','V4','22.000','43.000','1.4000'], ['James','V5','23.000','41.000','1.3000']]

    compare_list_old=[['Albert','V4','25.000','45.000','1.3900'], 
     ['James','V4','22.000','43.000','1.2000'], ['James','V5','23.000','41.000','1.2000']]

The idea is that both files have similar names so to find the new entry in the old entry, we must search based on the coordinates, so if I wanted to find a specific James from the new to the old I would have to use the '22.000', '43.000'.

After finding the entry then I take the 1.4000 from the new file and the 1.2000 from the old file and subtract them to find delta from the old to new.

This is the current Algorithm Im using:

    # This is not important
    import time
    import timeit
    import bisect
    from operator import itemgetter
    import time


    compare=open("factor.output.new.txt","w")
    compare_list_new=[]
    compare_list_old=[]
    newlist=[]

    #File Count algorithm

    start = time.time() # Tracks execution time

    def list_create(fname):  #Makes the list in the appropriate format
         newlist=[]
         with open(fname) as file:
              for i, line in enumerate(file):
                  if i>6:
                     for line in file:
                         lines_list=line.split(" ")
                         del lines_list[0]
                         del lines_list[2:29]
                         del lines_list[5:12]
                         newlist.append(lines_list)
         return newlist



     #Creates lists and sorts them

     compare_list_new=list_create("par_iop.pwr.sfactor.output_new.ipf")
     compare_list_new=sorted(compare_list_new, key=itemgetter(2))
     compare_list_old=list_create("par_iop.pwr.sfactor.output_old.ipf")
     compare_list_old=sorted(compare_list_old, key=itemgetter(2))



    compare.write("Name Version Coordinate_x Coordinate_y Sfactordelta FLAG\n")
    compare_list_copy=compare_list_old #Makes a copy of the list


    for item in compare_list_new: # compares both lists
        end = time.time()
        print(end - start)
        for line in compare_list_old:
            if item[0:4] == line[0:4]:
               s1=float(item[4])
               s2 = float(line[4])
               delta=s1-s2
               delta=format(delta,'.4f')
               item[4]=str(delta)
               text = " ".join(item)
               compare.write(text +"  " +"\n")
               compare_list_copy.remove(line)
               match=1
               break
         if(match==1):
            compare_list_old=compare_list_copy
            match=0
         else:
            text=" ".join(item)
            compare.write(text + "  " + "ITEM NOT FOUND IN OLD FILE BUT IS IN NEW FILE""\n")
            try:
               compare_list_copy.remove(line)
            except ValueError:
                  pass
            compare_list_old = compare_list_copy
    compare.close()

Essentially the part that compares both lists what it does after sorting them if they are match then it will do the operation to obtain the delta and remove it from the copy and then makes the old equal to the copy so that it does not remove items while iterating over the list. If the item is not a match then it means is not in the old file but it is in the new file.

I want something that could potentially make this process much faster.

Upvotes: 1

Views: 290

Answers (3)

Dinesh
Dinesh

Reputation: 1128

compare_list_new = [['Albert', 'V4', '25.000', '45.000', '1.3500'],
                    ['James', 'V4', '22.000', '43.000', '1.4000'],
                    ['James', 'V5', '23.000', '41.000', '1.3000']]

compare_list_old = [['Albert', 'V4', '25.000', '45.000', '1.3900'],
                    ['James', 'V4', '22.000', '43.000', '1.2000'],
                    ['James', 'V5', '23.000', '41.000', '1.2000']]

d = {}
for l in compare_list_old:
    # construct tuple as key and value as  'float' value
    d[tuple(l[0:3])] = l[4]

print(d)
# {('Albert', 'V4', '25.000'): '1.3900', ('James', 'V4', '22.000'): '1.2000', ('James', 'V5', '23.000'): '1.2000'}

print(d[('Albert', 'V4', '25.000')])
# 1.3900

for item in compare_list_new:
    old_float_val = d[tuple(item[0:3])]
    new_float_val = item[4]
    # continue whatever calculation here

Idea is to construct old list as dictionary with key and value. By this, we are not iterating the second list with respect to first list.

Upvotes: 0

abarnert
abarnert

Reputation: 365597

There's a lot of code here, and the indentation is clearly incorrect so I don't even know what exactly the logic is supposed to be, and there's no indication of which part you think is slow (or how you know), but one thing immediately jumps out:

compare_list_copy.remove(line)

… and another remove later.

First, whenever you call lst.remove(val), the list has to do a linear search, comparing each of the elements to val. But you already know the index of the element you want (or, rather, you could know it by just using enumerate), so this whole search is wasted; just del lst[idx] instead.

Second, whether you remove or del, you're still deleting from the middle of an array. That means shifting all of the subsequent elements up one slot. It's got a much faster constant (it's just a big memmove, rather than a bunch of calls to a compare function), but it's still linear.

And you're doing this inside your inner loop. So, you're multiplying an extra factor of N onto your already quadratic time. And any effort you put into doing the search in logarithmic rather than linear time via bisect is going to be wasted if you just follow that logarithmic search with a linear search over the same data.


If you need something you can search in logarithmic time, and also modify in logarithmic time, what you want is some kind of tree (or tree-list structure, like a skiplist). There are nice libraries wrapping all kinds of binary tree and b-tree variants on PyPI, or you can look up the algorithms on Wikipedia.

Or you can just grab something like the Sorted Containers library that wraps things up at a higher level. For example, a sorteddict acts a lot like a dict, but you can search for the nearest key instead of an exact match, or all keys within a given range, etc. Under the covers, it works with some kind of hybrid rope-of-btrees or something, but you don't need to care about those details; what matters is that it guarantees all of the operations you need in logarithmic time.


Once you've done that, at least one of your two outer loops can also be profitably turned into a logarithmic search (which you'll get almost for free by using a tree).

At which point your total time is O(log**2 N * N) instead of O(N**3), which is a huge difference.

If you're not used to dealing with performance in algorithmic complexity terms, consider this: with only 1000 elements, cubic time takes 1000*1000*1000 = 1 billion steps; log-squared-linear time takes 10*10*1000 = 100 thousand steps. That's the difference between days and seconds.

Upvotes: 3

carefullynamed
carefullynamed

Reputation: 447

Your current comparison is at least quadratic (because of the nested loop). It is faster to generate a dictionary from the first list (linear time), where the keys are a tuple of name and the first 2 coordinates (seems like they are the same for the new and old file) and then for each item in the second list, check if that key is in your dictionary (linear time again).

Upvotes: 1

Related Questions