Reputation: 4353
I am trying to do some clustering analysis on a dataset. I am using a number of different approaches to estimate the number of clusters, then I put what every approach gives (number of clusters) in a list, like so:
total_pred = [0, 0, 1, 1, 0, 1, 1]
Now I want to estimate the real number of clusters, so I let the methods above vote, for example, above, more models found 1 cluster than 0, so I take 1 as the real number of clusters.
I do this by:
counts = np.bincount(np.array(total_pred))
real_nr_of_clusters = np.argmax(counts))
There is a problem with this method, however. If the above list contains something like:
[2, 0, 1, 0, 1, 0, 1, 0, 1]
I will get 0 clusters as the average, since 0 is repeated more often. However, if one model found 2 clusters, it's safe to assume it considers at least 1 cluster is there, hence the real number would be 1.
How can I do this by modifying the above snippet?
To make the problem clear, here are a few more examples:
[1, 1, 1, 0, 0, 0, 3]
should return 1,
[0, 0, 0, 1, 1, 3, 4]
should also return 1 (since most of them agree there is AT LEAST 1 cluster).
Upvotes: 1
Views: 77
Reputation: 22314
Here is an implementation of the described algorithm.
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
l = sorted(l, reverse=True)
votes = {x: i for i, x in enumerate(l, start=1)}
{2: 1, 1: 5, 0: 9}
Notice that since you define a vote as agreeing with anything smaller than itself, then min(l)
will always win, because everyone will agree that there are at least min(l)
clusters. In this case min(l) == 0
.
Beforehand, notice that taking the mean or the median are valid and light-weight options that both satisfy the desired output on your examples.
Although, taking the mean might not be what you want if, for say, you encounter votes with high variance such as [0, 0, 7, 8, 10]
where it is unlikely that the answer is 5
.
A more general way to fix that is to include a voter's bias toward votes close to theirs. Surely that a 2-voter will agree more to a 1
than a 0
.
You do that by implementing a metric (note: this is not a metric in the mathematical sense) that determines how much an instance that voted for x
is willing to agree to a vote for y
on a scale of 0
to 1
.
Note that this approach will allow voters to agree on a number that is not on the list.
We need to update our code to account for applying that pseudometric.
def d(x, y):
return x <= y
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
{0: 9, 1: 5, 2: 1}
The above metric is a sanity check. It is the one your provided in your question and it indeed ends up determining that 0
wins.
You will have to toy a bit with your metrics, but here are a few which may make sense.
def d(x, y):
return 1 / (1 + abs(x - y))
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 6.33, 1: 6.5, 2: 4.33}
This one is a generalization of the previous. As n
grows, voters tend to agree less and less with distant vote casts.
def d(x, y, n=1):
return 1 / (1 + abs(x - y)) ** n
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 5.25, 2: 2.44}
Similar to the previous metric, this one is close to what you described at first in the sense that a voter will never agree to a vote higher than theirs.
def d(x, y, n=1):
return 1 / (1 + abs(x - y)) ** n if x >= y else 0
l = [2, 0, 1, 0, 1, 0, 1, 0, 1]
votes = {y: sum(d(x, y, n=2) for x in l) for y in range(min(l), max(l) + 1)}
# {0: 5.11, 1: 4.25, 2: 1.0}
An other option that would be sensical is a normal distribution or a skewed normal distribution.
Upvotes: 1
Reputation: 2402
While the other answer provides a comprehensive review of possible metrics and methods, it seems what you are seeking is to find the closest number of clusters to mean! So something as simple as:
cluster_num=int(np.round(np.mean(total_pred)))
Which returns 1 for all your cases as you expect.
Upvotes: 0