Baz
Baz

Reputation: 13135

Establishing highest score for a set of combinations

I'm programming in python.

I have the data of the following form:

(A, B, C, D, E, F, G, H, I)

Segments of this data are associated with a score, for example:

scores:

    (A, B, C, D) = .99
    (A, B, C, E) = .77
    (A, B, E) = .66
    (G,) = 1
    (I,) = .03
    (H, I) = .55
    (I, H) = .15
    (E, F, G) = .79
    (B,) = .93
    (A, C) = .46
    (D,) = .23
    (D, F, G) = .6
    (F, G, H) = .34
    (H,) = .09
    (Y, Z) = 1

We can get a score for this data as follows:

A B C E + D F G + H I = .77 * .6 * .55 = 0.2541

another possiblity is:

A B C D + E F G + H + I = .99 * .79 * .09 * .03 = 0.00211167

So, the first combination gives the higher score.

I wish to write an algorithm to establish for the data above the highest possible score. The members of data should no be repeated more than once. In other words:

A B C E + E F G + D + H I 

is not valid. How would you recommend I go about solving this?

Thanks,

Barry

EDIT: I should clarify that (H, I) != (I, H) and that (I, H) is not a subsegment for ABCDEFGHI, but is a subsegment of ABIHJ. Another thing I should mention is that scores is a very large set (millions) and the segment on which we're calculating the score has an average length of around 10. Furthermore, the way in which I calculate the score might change in the future. Maybe I'd like to add the subsegments and take an average instead of multipling, who knows... for that reason it might be better to seperate the code which calculates the possible combinations from the actual calculation of the score. At the moment, I'm inclined to think that itertools.combinations might offer a good starting point.

Upvotes: 5

Views: 1443

Answers (4)

Falk Hüffner
Falk Hüffner

Reputation: 5040

First, you could take the logarithm of each score, since then the problem is to maximize the sum of the scores instead of the product. Then, you can solve the problem as an Assignment Problem, where to each data point you assign one sequence.

Upvotes: 0

dstromberg
dstromberg

Reputation: 7177

First, I'd suggest assigning a unique symbol to the segments that make sense.

Then you probably want combinations of those symbols (or perhaps permutations, I'm sure you know your problem better than I do), along with a "legal_segment_combination" function you'd use to throw out bad possibilities - based on a matrix of which ones conflict and which don't.

>>> import itertools
>>> itertools.combinations([1,2,3,4], 2)
<itertools.combinations object at 0x7fbac9c709f0>
>>> list(itertools.combinations([1,2,3,4], 2))
[(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
>>>

Then max the valid possibilities that make it past legal_segment_combination().

Upvotes: 1

Nicolas78
Nicolas78

Reputation: 5144

This sounds like a NP-complete problem in disguise, a derivative of the Knapsack problem. This means you may have to walk through all possibilities to get an exact solution.

Even though... wait. Your values are between 0 and 1. That is the results can only get smaller of at most stay equal. Therefore the solution is trivial: Get the single group with the highest value, and be done with. (I'm aware that's probably not what you want, but you might have to add another condition, e.g. all elements have to be used..?)

A beginning of a brute force approach:

import operator

segment_scores = {(A, B, C, D): .99, (A, B, C, E): .77} #...

def isvalid(segments):
    """returns True if there are no duplicates
    for i in range(len(segments)-1):
        for element in segments[i]:
            for j in range(len(segments)-i-1):
              othersegment = segments[j+i+1]
              if element in othersegment:
                return False
    return True

    better way:
    """
    flattened = [item for sublist in segments for item in sublist]
    # http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
    return len(set(flattened)) == len(flattened)

def getscore(segments):
    """
    p = 1.0
    for segment in segments:
      p *= segment_scores[segment]
    return p

    better way:
    """
    return reduce(operator.mul, [segment_scores[segment] for segment in segments])

Now, create all 2^(num segments) possible combinations of segments, check for each if it is valid, and if it is, compute the score while keeping the current winner and its highscore. Just a starting point...

OK just another update: There's lots of space for optimizations here, in particular since you're multiplying (I'm assuming now you have to use each element).

  • Since your total score never increases, you can drop any exploration path [segment0, segment1] that drops below the current high score because you'll only get works for any segment2.

  • If you don't just iterate over all possibilities but start by exploring all segment lists that contain the first segment (by recursively exploring all segment lists that contain in addition the second segment and so on), you can break as soon as, for example, the first and the second segment are invalid, i.e. no need to explore all possibilities of grouping (A,B,C,D) and (A,B,C,D,E)

  • Since multiplying hurts, trying to minimize the number of segments might be a suitable heuristic, so start with big segments with high scores.

Upvotes: 2

Karl Knechtel
Karl Knechtel

Reputation: 61527

Brute-forcing, by using recursion (for each segment in order, we recursively find the best score using the segment, and the best score not using the segment. A score of 0 is assigned if there is no possible combination of segments for the remaining items):

segment_scores = (('A', 'B', 'C', 'D'), .99), (('A', 'B', 'C', 'E'), .77) #, ...

def best_score_for(items, segments, subtotal = 1.0):
    if not items: return subtotal
    if not segments: return 0.0
    segment, score = segments[0]
    best_without = best_score_for(items, segments[1:], subtotal)
    return max(
        best_score_for(items.difference(segment), segments[1:], subtotal * score),
        best_without
    ) if items.issuperset(segment) else best_without

best_score_for(set('ABCDEFGHI'), segment_scores) # .430155

Upvotes: 2

Related Questions