Reputation: 145
I have a list of dictionaries, like this:
[{'user': '123456', 'db': 'db1', 'size': '8628'}
{'user': '123456', 'db': 'db1', 'size': '7168'}
{'user': '123456', 'db': 'db1', 'size': '38160'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db5', 'size': '840'}
{'user': '34521', 'db': 'db6', 'size': '12288'}
{'user': '34521', 'db': 'db6', 'size': '476'}
{'user': '2345156', 'db': 'db7', 'size': '5120'}.....]
This list contains millions of entries. Each user can be found in multiple dbs, each user can have multiple entires in the same db. I want to sum up how much is the size occupied by each user, per each db. I don't want to use pandas. At the moment I do it this way:
result = []
for user in unique_users:
for db in unique_dbs:
total_size = 0
for i in big_list:
if (i['user'] == user and i['db'] == db):
total_size += float(i['size'])
if(total_size) > 0:
row = {}
row['user'] = user
row['db'] = db
row['size'] = total_size
result.append(row)
The problem is that this triple for loop develops into something very large (hundreds of billions of iterations) which takes forever to sum up the result. If the big_list is small, this works very well.
How should I approach this in order to keep it fast and simple? Thanks a lot!
Upvotes: 3
Views: 6046
Reputation: 795
In summary, looking at the three suggested answers to this problem, the approach from rdas is probably the winner. With a couple of modifications, it is 57% faster than the solution for Jérôme, and 180 times faster than the original code. Сергей's solution was roughly 350 times slower on a much trimmed subset of the results (1000 entries); it also seemed to scale very badly and I did not have time to wait for results from the full dataset.
The timings are as follows:
Method | Time | Relative |
---|---|---|
Original | 102.83720700000413 | 180.4580695623077 |
Jérôme | 0.8722491000080481 | 1.5306171118096032 |
Jérôme (updated) | 0.9000219000154175 | 1.5793526426731528 |
rdas | 0.6866130999987945 | 1.2048642527015463 |
rdas (updated) | 0.6580462999991141 | 1.1547354157572032 |
rdas with defaultdict | 0.5698675999883562 | 1.0 |
In Jérôme's answer, the final results calculation can be improved by swapping the final list construction to a list comprehension. Looking at the isolated list comprehension change, update code is roughly 38% faster. It is worth noting this isn't a major contributor to the overall time.
[{'user': user, 'db': db, 'size': total_size} for user in unique_users for db in unique_dbs if (total_size := aggregate_dict.get((user, db)))]
comparing before (Method1) and after (Method2) over a few iterations gives:
Method1: 6.447344800049905
Method2: 4.851344000024255
Method1: 6.546811900043394
Method2: 4.816081999975722
Method1: 6.863985500007402
Method2: 4.755402299982961
Method1: 6.8024070000392385
Method2: 4.738170499971602
Method1: 6.524162200046703
Method2: 4.755566900013946
The first update to rdas' answer was to avoid doing multiple dictionary lookups, by caching looking up the first dictionary:
if not (user_dict := user_to_db_to_size.get(user)):
user_to_db_to_size[user] = { db: size }
elif not db in user_dict:
user_dict[db] = size
else:
user_dict[db] += size
this was subsequently improved again by swapping to collections.defaultdict
which removes additional costly dictionary lookups:
user_to_db_to_size = defaultdict(lambda: defaultdict(int))
for entry in big_list:
user = entry['user']
db = entry['db']
size = int(entry['size'])
user_to_db_to_size[user][db] += size
return user_to_db_to_size
This is the code used to run the benchmarks:
import random
import timeit
from collections import defaultdict
random.seed(1)
test_iterations = 100
big_list = [{'user': random.randint(0, 100), 'db': f'db{random.randint(1, 10)}', 'size': f'{random.randint(100, 90000)}' } for i in range(10000)]
unique_users = { i['user'] for i in big_list }
unique_dbs = { i['db'] for i in big_list }
def method0():
result = []
for user in unique_users:
for db in unique_dbs:
total_size = 0
for i in big_list:
if (i['user'] == user and i['db'] == db):
total_size += float(i['size'])
if(total_size) > 0:
row = {}
row['user'] = user
row['db'] = db
row['size'] = total_size
result.append(row)
return result
def method1():
aggregate_dict = dict()
for i in big_list:
key = (i['user'], i['db'])
value = float(i['size'])
if key in aggregate_dict:
aggregate_dict[key] += value
else:
aggregate_dict[key] = value
result = []
for user in unique_users:
for db in unique_dbs:
total_size = aggregate_dict.get((user, db))
if total_size is not None and total_size > 0:
result.append({'user': user, 'db': db, 'size': total_size})
return result
def method2():
aggregate_dict = dict()
for i in big_list:
key = (i['user'], i['db'])
value = float(i['size'])
if key in aggregate_dict:
aggregate_dict[key] += value
else:
aggregate_dict[key] = value
return [{'user': user, 'db': db, 'size': total_size} for user in unique_users for db in unique_dbs if (total_size := aggregate_dict.get((user, db)))]
def method3():
user_to_db_to_size = {}
for entry in big_list:
user = entry['user']
db = entry['db']
size = int(entry['size'])
if user not in user_to_db_to_size:
user_to_db_to_size[user] = {}
if db not in user_to_db_to_size[user]:
user_to_db_to_size[user][db] = 0
user_to_db_to_size[user][db] += size
return user_to_db_to_size
def method4():
user_to_db_to_size = {}
for entry in big_list:
user = entry['user']
db = entry['db']
size = int(entry['size'])
if not (user_dict := user_to_db_to_size.get(user)):
user_to_db_to_size[user] = { db: size }
elif not db in user_dict:
user_dict[db] = size
else:
user_dict[db] += size
return user_to_db_to_size
def method5():
user_to_db_to_size = defaultdict(lambda: defaultdict(int))
for entry in big_list:
user = entry['user']
db = entry['db']
size = int(entry['size'])
user_to_db_to_size[user][db] += size
return user_to_db_to_size
assert (m0 := method0()) == method1()
print('Method1 OK')
assert m0 == method2()
print('Method2 OK')
assert all(v['size'] == method3()[v['user']][v['db']] for v in m0)
print('Method3 OK')
assert all(v['size'] == method4()[v['user']][v['db']] for v in m0)
print('Method4 OK')
assert all(v['size'] == method5()[v['user']][v['db']] for v in m0)
print('Method5 OK')
t0 = timeit.timeit(method0, number=test_iterations)
t1 = timeit.timeit(method1, number=test_iterations)
t2 = timeit.timeit(method2, number=test_iterations)
t3 = timeit.timeit(method3, number=test_iterations)
t4 = timeit.timeit(method4, number=test_iterations)
t5 = timeit.timeit(method5, number=test_iterations)
tmin = min((t0, t1, t2, t3, t4, t5))
print(f'| Method | Time | Relative |')
print(f'|------------------|----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Jérôme | {t1} | {t1 / tmin} |')
print(f'| Jérôme (updated) | {t2} | {t2 / tmin} |')
print(f'| rdas | {t3} | {t3 / tmin} |')
print(f'| rdas (updated) | {t4} | {t4 / tmin} |')
print(f'| defaultdict | {t5} | {t5 / tmin} |')
results in:
Method1 OK
Method2 OK
Method3 OK
Method4 OK
Method4 OK
| Method | Time | Relative |
|------------------|----------------------|
| Original | 102.83720700000413 | 180.4580695623077 |
| Jérôme | 0.8722491000080481 | 1.5306171118096032 |
| Jérôme (updated) | 0.9000219000154175 | 1.5793526426731528 |
| rdas | 0.6866130999987945 | 1.2048642527015463 |
| rdas (updated) | 0.6580462999991141 | 1.1547354157572032 |
| defaultdict | 0.5698675999883562 | 1.0 |
Note Сергей's method is roughly equivalent to:
def c():
tmp = Counter()
for d in big_list:
tmp = tmp + Counter(d)
return tmp
so is effectively creating a new dictionary with every iteration in the loop. For each iteration in the loop the dictionary it creates will be bigger. I suspect this scales as N ** 2
or worse.
Upvotes: 3
Reputation: 1723
If you use Counter and make tuples of value pairs (user, db) as keys, then:
from collections import Counter
data = [{'user': '123456', 'db': 'db1', 'size': '8628'},
{'user': '123456', 'db': 'db1', 'size': '7168'},
{'user': '123456', 'db': 'db1', 'size': '38160'},
{'user': '222345', 'db': 'db3', 'size': '8628'},
{'user': '222345', 'db': 'db3', 'size': '8628'},
{'user': '222345', 'db': 'db5', 'size': '840'},
{'user': '34521', 'db': 'db6', 'size': '12288'},
{'user': '34521', 'db': 'db6', 'size': '476'},
{'user': '2345156', 'db': 'db7', 'size': '5120'}]
print(sum((Counter({(d['user'], d['db']): int(d['size'])}) for d in data), start=Counter()))
Counter({('123456', 'db1'): 53956, ('222345', 'db3'): 17256, ('34521', 'db6'): 12764, ('2345156', 'db7'): 5120, ('222345', 'db5'): 840})
Upvotes: 1
Reputation: 50678
There are two main issue with the current approach: the inefficient algorithm and the inefficient data structure.
The first is that the algorithm used is clearly inefficient as it iterates many times over the big list. There is not need to iterate over the whole list to filter a unique user and db. You can iterate over the big list once and aggregate data using a dictionary. The key of the target dictionary is simply a (user, db)
tuple. The value of the dictionary is total_size
. Here is an untested example:
# Aggregation part
# Note: a default dict can be used instead to make the code possibly simpler
aggregate_dict = dict()
for i in big_list:
key = (i['user'], i['db'])
value = float(i['size'])
if key in aggregate_dict:
aggregate_dict[key] += value
else:
aggregate_dict[key] = value
# Fast creation of `result`
result = []
for user in unique_users:
for db in unique_dbs:
total_size = aggregate_dict.get((user, key))
if total_size is not None and total_size > 0:
result.append({'user': user, 'db': db, 'size': total_size})
The other issue is the inefficient data structure: for each row, the keys are replicated while tuples can be used instead. In fact, a better data structure is to store a dictionary of (column, items)
key-values where items
is a list of items for the target column. This way of storing data is called a dataframe. This is roughly what Pandas uses internally (except it is a Numpy array which is even better as it is more compact and generally more efficient than a list for most operations). Using this data structure for both the input and the output should result in a significant speed up (if combined with Numpy) and a lower memory footprint.
Upvotes: 3
Reputation: 21285
Try mapping user to db to the total size in a dictionary. It will require additional memory but should be faster to access & requires only a single pass through the data:
user_to_db_to_size = {}
for entry in unique_users:
user = entry['user']
db = entry['db']
size = int(entry['size'])
if user not in user_to_db_to_size:
user_to_db_to_size[user] = {}
if db not in user_to_db_to_size[user]:
user_to_db_to_size[user][db] = 0
user_to_db_to_size[user][db] += size
print(user_to_db_to_size)
For your sample data it produces:
{'123456': {'db1': 53956}, '222345': {'db3': 17256, 'db5': 840}, '34521': {'db6': 12764}, '2345156': {'db7': 5120}}
And now you can access the total size per user/db with:
print(user_to_db_to_size['123456']['db1']) # 53956
Upvotes: 3