Reputation: 137
I have two files, one has 4K strings to be 4K rows, one has 100K to be 100K rows.
For each string in the 4k rows, I calculated the similarity ratio between the string and each string in the 100k string, and I pick the string in the 100k rows with the highest similarity ratio as a "match" to the row in the 4k file.
I tried to finish the job using the python dictionary. I was told it would be efficient.
But my code is not efficient, see the following:
for k,k2 in itertools.product(dict1.keys(),my_dict1.keys()):
a=float(difflib.SequenceMatcher(None,k,k2).ratio())
if a>0.80:
my_dict3[k+"t"+k2]=a
for key2 in my_dict3.keys():
k1=key2.split("t")[0]
k2=key2.split("t")[1]
mydict[k1][k2]=my_dict3[key2]
k=key2.split("t")
keylist4=mydict.keys()
for key4 in keylist4:
key=max(mydict[key4].iteritems(),key=operator.itemgetter(1))[0]
print "%st%s" % (key4,key)
I am wondering why the code is not efficient. But it should be. How to improve?
I think I did something wrong, but not sure where.
Thank you!
Upvotes: 0
Views: 94
Reputation: 76887
Though this particular piece of code can be slightly optimized, the time complexity will still remain O(m*n)
, where m
, n
are the number of keys in each dict.
Since dict_1 has 4K
keys, and dict_2
has 100K keys, total combinations to iterate over
100K*4K = 400M
If for each combination it took you 0.1 ms
to figure stuff out, time still needed to completely run this program
400M/(10000*86400) = 472 days = 1.4 years
Even if you are able to improve the performance by 20%
, you will still take 1.4*0.8 = 1.1 year
.
Even if you uses 10 simultaneous threads to do this, you would need a month and a half to run this.
So, it is best to figure out another algorithmic solution to this problem of yours, which performs better in terms of time complexity.
Upvotes: 2