Reputation: 99
I have an array of dicts that I need each combination of without duplicates based on no repeating id value and a sum of a ratio value
So the results would be:
results = [
[
{
'id': 1
'ratio': .01
},
{
'id': 2
'ratio': .99
},
],
[
{
'id': 1
'ratio': .50
},
{
'id': 2
'ratio': .50
},
],
[ ... ],
[ ... ],
]
For example:
_array_list = [
{
'id': 1
'ratio': .01
},
{
'id': 1
'ratio': .02
},
....
{
'id': 2
'ratio': .01
}
{
'id': 3
'ratio': .02
}
...
]
Each id has between .01-1.0 by .01
I then do to get each possible combination
(there is a reason for this but i am leaving out the stuff that hasn't anything to do with the issue)
from itertools import combinations
unique_list_count = 2 #(this is each id)
all_combos = []
for i in range(1,len(unique_list_count)+1):
for combo in combinations(_array_list , i):
_iter_count += 1
ids = []
# if iter_count > 1:
# break
for c in combo:
ids.append(c['id'])
is_id_duplicate = len(ids) != len(set(ids))
if is_id_duplicate is False:
# make sure only appending full values
if sum(v['ratio'] for v in combo) == 1.0:
iter_count += 1
print(iter_count, _iter_count)
all_combos.append(list(combo))
I'm not sure if this is a good way or if i can even make this better but it works. The issue is that when i have 5 IDs, each with 100 dictionaries, it will do about 600,000,000 combinations and take about 20 minutes
Is there a way to do this in a more efficient and faster way?
Upvotes: 0
Views: 909
Reputation: 9796
I assume the general case where not each id has all values in [0.01, 1.0].
There are 3 main optimisations you can make and they all aim to instantly drop branches that are guaranteed to not satisfy your conditions.
This way you instantly avoid pointless combinations, .e.g., [{'id': 1, 'ratio': 0.01}, {'id': 1, 'ratio': 0.02}]
. It also makes it easier to try combinations between ids. So, instead of having everything in a flat list of dicts, reorganise the data in the following form:
# if your ids are 0-based consecutive integer numbers, a list of lists works too
array_list = {
1: [0.01, 0.02, ...],
2: [0.01, 0.02, ...],
3: [...],
}
If you're searching for a triplet and you already have (0.54, 0.33, _), you don't have to search all possible values for the last id. There is only one that can satisfy the condition sum(ratios) == 1.0
.
Say you have 3 ids and they all have all the values in [0.01, 0.44]. It is pointless to try any combinations for (0.99, _, _), because the minimum sum for the last two ids is 0.02. Therefore, the maximum value that the first id can explore is 0.98 (well, 0.44 in this example but you get my drift). Similarly, if the maximum sum of the last two ids is 0.88, there is no reason to explore values below 0.12 for the first id. A special case of this is where the sum of the minimum value of all ids is more than 1.0 (or the max < 1.0), in which case you can instantly drop this combination and move on.
You are blessed in dealing only with some discrete values, so you're better off converting everything to integers. The first reason is to avoid any headaches with floating arithmetic. Case in point, did you know that your code misses some combinations exactly due to these inaccuracies?
And since you will be generating your own value ranges due to optimisation #3, it's much simpler to do for i in range(12, 99)
than some roundabout way to generate all values in [0.12, .99) while making sure everything is properly rounded off at the second decimal digit AND THEN properly added together and checked against some tolerance value close to 1.0.
from collections import defaultdict
import itertools as it
def combined_sum(values):
def _combined_sum(values, comb, partial_sum, depth, mins, maxs):
if depth == len(values) - 1:
required = 100 - partial_sum
if required in values[-1]:
yield comb + (required,)
else:
min_value = mins[depth+1]
max_value = maxs[depth+1]
start_value = max(min(values[depth]), 100 - partial_sum - max_value)
end_value = max(1, 100 - partial_sum - min_value)
for value in range(start_value, end_value+1):
if value in values[depth]:
yield from _combined_sum(values, comb+(value,), partial_sum+value, depth+1, mins, maxs)
# precompute all the partial min/max sums, because we will be using them a lot
mins = [sum(min(v) for v in values[i:]) for i in range(len(values))]
maxs = [sum(max(v) for v in values[i:]) for i in range(len(values))]
if mins[0] > 100 or maxs[0] < 100:
return []
return _combined_sum(values, tuple(), 0, 0, mins, maxs)
def asset_allocation(array_list, max_size):
# a set is preferred here because we will be checking a lot whether
# a value is in the iterable, which is faster a set than in a tuple/list
collection = defaultdict(set)
for d in array_list:
collection[d['id']].add(int(round(d['ratio'] * 100)))
all_combos = []
for i in range(1, max_size+1):
for ids in it.combinations(collection.keys(), i):
values = [collection[ID] for ID in ids]
for group in combined_sum(values):
all_combos.append([{'id': ID, 'ratio': ratio/100} for ID, ratio in zip(ids, group)])
return all_combos
array_list = [{'id': ID, 'ratio': ratio/100}
for ID in (1, 2, 3, 4, 5)
for ratio in range(1, 101)
]
max_size = 5
result = asset_allocation(array_list, max_size)
This finishes in 14-15 seconds on my machine.
For comparison, for 3 ids this finishes in 0.007 seconds and Gabor's solution which effectively implements only optimisation #1 finishes in 0.18 seconds. For 4 ids it's .43 s and 18.45 s respectively. For 5 ids I stopped timing his solution after a few minutes, but it was expected to take at least 10 minutes.
If you are dealing with the case where all ids have all the values in [0.01, 1.0] and you insist on having the specific output indicated in your question, the above approach is still optimal. However, if you are okay with generating the output in a different format, you can do better.
For a specific group size, e.g., singles, pairs, triplets, etc, generate all the partitions that add up to 100 using the stars and bars approach. That way, instead of generating (1, 99), (2, 98), etc, for each pair of ids, i.e., (1, 2), (1, 3) and (2, 3), you do this only once.
I've modified the code from here to not allow for 0 in any partition.
import itertools as it
def partitions(n, k):
for c in it.combinations(range(1, n), k-1):
yield tuple(b-a for a, b in zip((0,)+c, c+(n,)))
def asset_allocation(ids, max_size):
all_combos = []
for k in range(1, max_size+1):
id_comb = tuple(it.combinations(ids, k))
p = tuple(partitions(100, k))
all_combos.append((id_comb, p))
return all_combos
ids = (1, 2, 3, 4, 5)
result = asset_allocation(ids, 5)
This finishes much faster, takes up less space, and also allows you to home in to all the combinations for singles, pairs, etc, individually. Now, if you were to take the product of id_comb
and p
to generate the output in your question, you'd lose all that time saved. In fact, it'd come out as a biiit slower than the general method from above, but at least this piece of code is still more compact.
Upvotes: 0
Reputation: 475
You could use the below code. The advantage of using it is that it won't consider cases with repeating ids:
import itertools
from math import isclose
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(len(s)+1))
def combosSumToOneById(inDictArray):
results = []
uniqueIds = {d['id'] for d in inDictArray}
valuesDict = {id:[d['ratio'] for d in inDictArray if d['id']==id] for id in uniqueIds}
for idCombo in powerset(uniqueIds):
for valueCombo in itertools.product(*[v for k,v in valuesDict.items() if k in idCombo]):
if isclose(sum(valueCombo), 1.0):
results.append([{'id':xid, 'ratio': xv} for xid, xv in zip(idCombo, valueCombo)])
return results
I tested it on the below input
_array_list = [
{
'id': '1',
'ratio': .1
},
{
'id': '1',
'ratio': .2
},
{
'id': '2',
'ratio': .9
},
{
'id': '2',
'ratio': .8
},
{
'id': '3',
'ratio': .8
}]
combosSumToOneById(_array_list)
Returns: [[{'id': '1', 'ratio': 0.1}, {'id': '2', 'ratio': 0.9}], [{'id': '1', 'ratio': 0.2}, {'id': '2', 'ratio': 0.8}], [{'id': '1', 'ratio': 0.2}, {'id': '3', 'ratio': 0.8}]]
Yous should test it if the performance really exceeds the previous one.
Please note that I modified the code to check for isclose(sum, 1.0)
rather than sum == 1
.Since we are summing double values there most likely will be some error from the representation of the numbers which is why using this condition seems more appropriate.
Upvotes: 1
Reputation: 5954
Until someone who understands the algorithm better than I do comes along, I don't think there's any way of speeding that up with the data types you have.
With the algorithm you are cuurently using:
And then use a multiprocessing.Pool()
to use all your cpus at once. Since this is cpu-bound it will get you a reasonable speed up.
But I'm sure there is a more efficient way of doing this. You haven't said how you're getting your data, but if you can represent in an array it might be vectorisable, which will be orders of magnitude faster.
Upvotes: 0