How to improve the efficiency of the script?

Question

I have two files, one has 4K strings to be 4K rows, one has 100K to be 100K rows.

For each string in the 4k rows, I calculated the similarity ratio between the string and each string in the 100k string, and I pick the string in the 100k rows with the highest similarity ratio as a "match" to the row in the 4k file.

I tried to finish the job using the python dictionary. I was told it would be efficient.

But my code is not efficient, see the following:

for k,k2 in itertools.product(dict1.keys(),my_dict1.keys()):
   a=float(difflib.SequenceMatcher(None,k,k2).ratio())
     if a>0.80:
         my_dict3[k+"t"+k2]=a


for key2 in my_dict3.keys():
        k1=key2.split("t")[0]
        k2=key2.split("t")[1]
        mydict[k1][k2]=my_dict3[key2]
        k=key2.split("t")

keylist4=mydict.keys()

for key4 in keylist4:
        key=max(mydict[key4].iteritems(),key=operator.itemgetter(1))[0]
        print "%st%s" % (key4,key)

I am wondering why the code is not efficient. But it should be. How to improve?

I think I did something wrong, but not sure where.

Thank you!

How to improve the efficiency of the script?

Answers (1)

Related Questions