Reputation: 107
I have read a couple of different articles on how PAM selects the initial medoids but I am getting conflicting views.
Some propose that the k first medoids are selected randomly, while others suggest that the algorithm selects initially the k representative medoids in the dataset (not clarifying how that "representativeness" happens though). Below I have listed these resources:
Drawbacks of K-Medoid (PAM) Algorithm
https://paginas.fe.up.pt/~ec/files_1112/week_06_Clustering_part_II.pdf
https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-and-practical-examples/
1) My question would be if someone could explain in more detail how the algorithm selects the initial k medoids as from what I understand different initial selection can lead to different results.
2) Also is that one of the reasons of using CLARA (apart from minimizing computing time and RAM storage problem) - that is to find medoids through resampling that are the "optimal" options?
I am using R as a parenthesis, with the function pam(). Open to other functions in other libraries if there is a better alternative I am not aware of.
Upvotes: 1
Views: 420
Reputation: 77474
Read the original sources.
There is a lot of nonsense written later, unfortunately.
PAM consists of two algorithms:
The k-means style algorithm works much worse than PAM. Any description of PAM that doesn't mention these two parts is inaccurate (and there are quite some of these...)
The R package seems to use the real PAM algorithm:
By default, when medoids are not specified, the algorithm first looks for a good initial set of medoids (this is called the build phase). Then it finds a local minimum for the objective function, that is, a solution such that there is no single switch of an observation with a medoid that will decrease the objective (this is called the swap phase)
CLARA clearly will find worse solutions than PAM, as it runs PAM on a sample, and I'd the optimum medoids are not in the sample, then they cannot be found.
Upvotes: 2