user3634601
user3634601

Reputation: 137

How to improve the efficiency of the script?

I have two files, one has 4K strings to be 4K rows, one has 100K to be 100K rows.

For each string in the 4k rows, I calculated the similarity ratio between the string and each string in the 100k string, and I pick the string in the 100k rows with the highest similarity ratio as a "match" to the row in the 4k file.

I tried to finish the job using the python dictionary. I was told it would be efficient.

But my code is not efficient, see the following:

for k,k2 in itertools.product(dict1.keys(),my_dict1.keys()):
   a=float(difflib.SequenceMatcher(None,k,k2).ratio())
     if a>0.80:
         my_dict3[k+"t"+k2]=a


for key2 in my_dict3.keys():
        k1=key2.split("t")[0]
        k2=key2.split("t")[1]
        mydict[k1][k2]=my_dict3[key2]
        k=key2.split("t")

keylist4=mydict.keys()

for key4 in keylist4:
        key=max(mydict[key4].iteritems(),key=operator.itemgetter(1))[0]
        print "%st%s" % (key4,key) 

I am wondering why the code is not efficient. But it should be. How to improve?

I think I did something wrong, but not sure where.

Thank you!

Upvotes: 0

Views: 94

Answers (1)

Anshul Goyal
Anshul Goyal

Reputation: 76887

Though this particular piece of code can be slightly optimized, the time complexity will still remain O(m*n), where m, n are the number of keys in each dict.

Since dict_1 has 4K keys, and dict_2 has 100K keys, total combinations to iterate over

100K*4K = 400M

If for each combination it took you 0.1 ms to figure stuff out, time still needed to completely run this program

400M/(10000*86400) = 472 days = 1.4 years

Even if you are able to improve the performance by 20%, you will still take 1.4*0.8 = 1.1 year.

Even if you uses 10 simultaneous threads to do this, you would need a month and a half to run this.

So, it is best to figure out another algorithmic solution to this problem of yours, which performs better in terms of time complexity.

Upvotes: 2

Related Questions