Reputation: 902
I have to compare two lists of dicts such as below:
main = [{'id': 1,'rate': 13,'type'= 'C'}, {'id': 2,'rate': 39,'type': 'A'}, ...]
compare = [{'id': 119, 'rate': 33, 'type': 'D'}, {'id': 120, 'rate': 94, 'type': 'A'}, ...]
for m in main:
for c in compare:
if (m['rate'] > c['rate']) and (m['type'] == c['type']):
# ...
The lists have around 9,000 items. The above code runs around 81,000,000 times (9,000 * 9,000). How can I speed this up?
Upvotes: 1
Views: 115
Reputation: 59633
You could first sort or split the lists by type and perform the comparisons per type only. The question then is: how many operations do you need for sorting or splitting and how many for comparison. Remember that there are quite efficient sort algorithms.
The next optimizazion could be sorting by rate. That way you can break the loop when the condition m['rate'] > c['rate']
is not satisfied any more. In fact, you can even do a command and conquer algorithm.
Last not least, you might benefit from Why is processing a sorted array faster than processing an unsorted array?, which is not an algorithmic improvement, but can still make a huge difference.
Let me generate a dataset with 9000 items (in the future, you may want to include such a thing in your question, since it makes our life easier):
import random
types = ["A", "B", "C", "D", "E", "F"]
main=[]
compare = []
for i in range(9000):
main.append({'id':random.randint(0,20000), 'rate':random.random()*500, 'type':types[random.randint(0,5)]})
compare.append({'id': random.randint(0, 20000), 'rate': random.random() * 500, 'type': types[random.randint(0, 5)]})
Running this with a loop like
import time
start = time.time()
cycles = 0
for m in main:
for c in compare:
cycles += 1
if (m['rate'] > c['rate']) and (m['type'] == c['type']):
pass
end = time.time()
print("Total number of cycles "+str(cycles))
print("Seconds taken: " + str(end - start))
it results (on my machine) in 81M cycles and ~30 seconds.
Splitting by type might look like this:
# Split by types
mainsplit = {}
compsplit = {}
for t in types:
cycles += 1
mainsplit[t] = []
compsplit[t] = []
for m in main:
cycles += 1
mainsplit[m["type"]].append(m)
for c in compare:
cycles += 1
compsplit[c["type"]].append(c)
# Then go through it by type
for t in types:
for m in mainsplit[t]:
for c in compsplit[t]:
cycles += 1
if m['rate'] > c['rate']:
pass
This gives ~14M cycles and only ~4 s.
Sorting the partial results by "rate" and finding a lower limit for "rate" :
# Then go through it by type
for t in types:
mainsplit[t].sort(key=lambda i:i["rate"])
compsplit[t].sort(key=lambda i:i["rate"])
start_of_m_in_c = 0
for m in mainsplit[t]:
for nc in range(start_of_m_in_c, len(compsplit[t])):
cycles += 1
if m["rate"] > compsplit[t][nc]["rate"]:
pass
else:
start_of_m_in_c = nc
Cycles is now 36000 (not counting the cycles used by the sort algorithm) and the time to 30 ms.
All in all, that's a performance increase of a factor 1000.
Upvotes: 3
Reputation: 1796
This looks like a job for sqlite - it's the kind of thing databases are totally optimized for. Python has very nice bindings to sqlite, so it should fit nicely.
Here's a starting point...
import sqlite3
c = None
try:
c = sqlite3.connect(':memory:')
c.execute('create table main ( id integer primary key, rate integer not null, type text not null );')
main = [{'id': 1,'rate': 13,'type': 'C'}, {'id': 2,'rate': 39,'type': 'A'}]
for e in main:
c.execute('insert into main (id, rate, type) VALUES (' + str(e['id']) + ', ' +
str(e['rate']) + ',\"' + e['type'] + '\")')
# now for the query
# exercise left for the OP (but does require some SQL expertise)
except Error as e:
print(e)
finally:
if c:
c.close()
Upvotes: 0
Reputation: 107095
Given:
main = [
{'id': 1, 'rate': 13, 'type': 'C'},
{'id': 2, 'rate': 39, 'type': 'A'},
{'id': 3, 'rate': 94, 'type': 'A'},
{'id': 4, 'rate': 95, 'type': 'A'},
{'id': 5, 'rate': 96, 'type': 'A'}
]
compare = [
{'id': 119, 'rate': 33, 'type': 'D'},
{'id': 120, 'rate': 94, 'type': 'A'}
]
You can first map the two lists of dicts into two dicts of lists of dicts indexed by type
, and sort sub-lists by rate
:
mappings = []
for lst in main, compare:
mappings.append({})
for entry in lst:
mappings[-1].setdefault(entry['type'], []).append(entry)
for entries in mappings[-1].values():
entries.sort(key=lambda entry: entry['rate'])
main, compare = mappings
so that main
becomes:
{'C': [{'id': 1, 'rate': 13, 'type': 'C'}],
'A': [{'id': 2, 'rate': 39, 'type': 'A'},
{'id': 3, 'rate': 94, 'type': 'A'},
{'id': 4, 'rate': 95, 'type': 'A'},
{'id': 5, 'rate': 96, 'type': 'A'}]}
while compare
becomes:
{'D': [{'id': 119, 'rate': 33, 'type': 'D'}],
'A': [{'id': 120, 'rate': 94, 'type': 'A'}]}
so that you iterate through the matching types of the two dicts in linear time, and use bisect
to find the index in each sub-list of main
where the rate
is greater than that of compare
, which takes a time complexity of O(log n), and then iterate through the rest of the sub-list from that index for processing. Overall this algorithm is of O(n log n) in time complexity, an improvement over the O(n ^ 2) time complexity of your original code:
from bisect import bisect
for type in main.keys() & compare.keys():
for entry in compare[type]:
main_entries = main[type]
for match in main_entries[bisect([d['rate'] for d in main_entries], entry['rate']):]:
print(match['id'], entry['id'])
This outputs:
4 120
5 120
Demo: https://repl.it/repls/EasygoingReadyTechnologies
Disclaimer: This may look like an implementation of @ThomasWeller's solution but I actually did not see his answer until I finished my coding, which was interrupted by my other work. Also @ThomasWeller wants to sort the two lists by type
, which would incur an O(n log n) time complexity, when it can be done in linear time as shown in the for entry in lst
loop in my code.
Upvotes: 1
Reputation: 1221
You can use PyPy interpretator instead of classic Cpython. It can give you abaout 80% speedup
Upvotes: 0