sorpha
sorpha

Reputation: 433

Find a range that generates a set that is the closest to the "correct" set

I'm trying to design an algorithm that can find a range of numbers that best fits a defined correct set of numbers in a super set. I've included an image below that should hopefully clarify exactly what I mean.

Correct Set

One ideal solution to the above problem would be the range [2-5], which would yield the following result:

enter image description here

I don't really have a defined performance metric for the algorithm but I suppose this would be a good start:

enter image description here ^^ The above should be a '+' not a minus.

Brute forcing would not be ideal since these sets could contain thousands of numbers. My current idea is just averaging the correct set and adding/subtracting a standard deviation but there must be a better way.

Upvotes: 2

Views: 108

Answers (2)

Nuclearman
Nuclearman

Reputation: 5304

I suspect cosine similarity might be a better metric (if more expensive to calculate) for comparing the two multisets (counters). Though, you'll need to adjust the values for A[i] and B[i].

In the case that A is the correct multiset and B is the range multiset, then are a few cases that need to be considered, most don't appear in your example so not sure how relevant those cases are.

  1. B[i] == A[i], A[i] != 0: The exact case. The value in the range occurs exactly the same amount as in the correct case. This happens with i=4 and i=5 in your example.
  2. B[i] is slightly more than A[i]: This occurs with the other values in the multiset. In this case there should be slight penalty.
  3. B[i] is more than A[i]: For example B[i] might be 100 and A[i] might be 30. In this case, there should be a moderate to large size penalty.
  4. B[i] is much more than A[i]: For example B[i] might be 1000 and A[i] might be 5. In this case, there should be a moderate to large size penalty.
  5. B[i] == A[i], A[i] == 0: This case is difficult to determine. i is within the range, but doesn't exist in A or B. Likely the penalty will be some constant.

Towards the end, it seems like it's best to use B' and A' instead of B and A for some k >= 1 and c >= 1. Calculated based on the following Pythonic pseudocode.

Version 1:
    A' = 0
    if A[i] == B[i]:
        if A[i] == 0:
            B'[i] = 0
        else:
            B'[i] = c
    else:
        B'[i] = (A[i] - B[i])^k

Seems like k = 1 or k = 2 would work. While c could be calculated based on the size of the range.

If the fact that B[i] is not equal to A[i] makes the difference irreverent the formula simplifies to simply:

Version 2
    A' = 0
    if A[i] == B[i]:
        if A[i] == 0:
            B' = 0
        else:
            B'[i] = c
    else:
        B'[i] = b

Where b and c are simply some constants and c > b. I'm thinking Version 2 reduces to the Jaccard index in either the case where c = 1 and b = 1 or c = 0 and b = 1. The issue with the Jaccard index is that if you have say A[5] == B[5] == 1 and A[10000] == B[10000] == 1 as the only two elements. The Jaccard index for the range [5,10000] is 1. This might not actually be a problem, but it's something to consider.

Upvotes: 1

j_random_hacker
j_random_hacker

Reputation: 51226

Whatever metric you eventually decide on, if it takes O(n) time to calculate the metric on a range of length n then this can be solved to optimality in O(n^3) time: all you need to do is sort the numbers (O(n log n)) and then for each of the O(n^2) possible (start number, end number) combinations from this range, calculate your metric in O(n) time per combination.

In fact most metrics can be computed incrementally in constant time, so that you can probably get O(n^2) overall without much trouble. E.g. your stated metric (which, BTW, I agree is probably not the best as 0 can easily appear in the denominator) can be incrementally calculated very easily: for a range (a, b), record the count of correct guesses and the count of incorrect guesses; calculating your metric from these two numbers is simply a subtraction and a division. To then compute the answer for range (a, b+1), simply increment whichever one of these two totals is appropriate.

What metric would I recommend? The Jaccard index, which is always between 0 and 1, and in particular, always defined, provided that at least one of the two sets you are comparing (here, the numbers in the range, and the numbers in your "defined correct set") has more than 0 elements.

[EDIT: Provided that your metric has the totally reasonable property that making the range wider than necessary never makes it better, you can do much better than trying all pairs of numbers in the input: you only need to try all pairs of numbers in the correct set. If the correct set is much smaller than the total input, this will be a big win. ]

Upvotes: 1

Related Questions