Qubix
Qubix

Reputation: 4353

Count number of repeated elements in list considering the ones larger than them

I am trying to do some clustering analysis on a dataset. I am using a number of different approaches to estimate the number of clusters, then I put what every approach gives (number of clusters) in a list, like so:

total_pred =  [0, 0, 1, 1, 0, 1, 1]

Now I want to estimate the real number of clusters, so I let the methods above vote, for example, above, more models found 1 cluster than 0, so I take 1 as the real number of clusters.

I do this by:

    counts = np.bincount(np.array(total_pred))
    real_nr_of_clusters = np.argmax(counts))

There is a problem with this method, however. If the above list contains something like:

[2, 0, 1, 0, 1, 0, 1, 0, 1]

I will get 0 clusters as the average, since 0 is repeated more often. However, if one model found 2 clusters, it's safe to assume it considers at least 1 cluster is there, hence the real number would be 1.

How can I do this by modifying the above snippet?

To make the problem clear, here are a few more examples:

[1, 1, 1, 0, 0, 0, 3] 

should return 1,

[0, 0, 0, 1, 1, 3, 4]

should also return 1 (since most of them agree there is AT LEAST 1 cluster).

Upvotes: 1

Views: 77

Answers (2)

Olivier Melançon
Olivier Melançon

Reputation: 22314

There is a problem with your logic

Here is an implementation of the described algorithm.

l = [2, 0, 1, 0, 1, 0, 1, 0, 1]

l = sorted(l, reverse=True)

votes = {x: i for i, x in enumerate(l, start=1)}

Output

{2: 1, 1: 5, 0: 9}

Notice that since you define a vote as agreeing with anything smaller than itself, then min(l) will always win, because everyone will agree that there are at least min(l) clusters. In this case min(l) == 0.

How to fix it

Mean and median

Beforehand, notice that taking the mean or the median are valid and light-weight options that both satisfy the desired output on your examples.

Bias

Although, taking the mean might not be what you want if, for say, you encounter votes with high variance such as [0, 0, 7, 8, 10] where it is unlikely that the answer is 5.

A more general way to fix that is to include a voter's bias toward votes close to theirs. Surely that a 2-voter will agree more to a 1 than a 0.

You do that by implementing a metric (note: this is not a metric in the mathematical sense) that determines how much an instance that voted for x is willing to agree to a vote for y on a scale of 0 to 1.

Note that this approach will allow voters to agree on a number that is not on the list.

We need to update our code to account for applying that pseudometric.

def d(x, y):
    return x <= y

l = [2, 0, 1, 0, 1, 0, 1, 0, 1]

votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}

Output

{0: 9, 1: 5, 2: 1}

The above metric is a sanity check. It is the one your provided in your question and it indeed ends up determining that 0 wins.

Metric choices

You will have to toy a bit with your metrics, but here are a few which may make sense.

Inverse of the linear distance

def d(x, y):
    return 1 / (1 + abs(x - y))

l = [2, 0, 1, 0, 1, 0, 1, 0, 1]

votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 6.33, 1: 6.5, 2: 4.33}

Inverse of the nth power of the distance

This one is a generalization of the previous. As n grows, voters tend to agree less and less with distant vote casts.

def d(x, y, n=1):
    return 1 / (1 + abs(x - y)) ** n

l = [2, 0, 1, 0, 1, 0, 1, 0, 1]

votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 5.25, 2: 2.44}

Upper-bound distance

Similar to the previous metric, this one is close to what you described at first in the sense that a voter will never agree to a vote higher than theirs.

def d(x, y, n=1):
    return 1 / (1 + abs(x - y)) ** n if x >= y else 0

l = [2, 0, 1, 0, 1, 0, 1, 0, 1]

votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 4.25, 2: 1.0}

Normal distribution

An other option that would be sensical is a normal distribution or a skewed normal distribution.

Upvotes: 1

anishtain4
anishtain4

Reputation: 2402

While the other answer provides a comprehensive review of possible metrics and methods, it seems what you are seeking is to find the closest number of clusters to mean! So something as simple as:

cluster_num=int(np.round(np.mean(total_pred)))

Which returns 1 for all your cases as you expect.

Upvotes: 0

Related Questions