Matt Yoon
Matt Yoon

Reputation: 446

What would be the best k for this kmeans clustering? (Elbow point plot)

I am trying kmeans to find the optimal place to start a coffee shop near subway station in Seoul.

Included features are:

  1. Total monthly alights on a particular station
  2. Rental Fees near a particular station
  3. Number of existing coffee shops near a particular station

I decided to use elbow point to find the best k. I did standardize all the features before running kmeans.

enter image description here

Now the elbow point seems to be k=3(or maybe k=2), but I think the SSE is too high for an elbow point.

Also using k=3, it was difficult to gain insights from the clusters because there were only three of them.

Using k=5 was the sweet spot to gain insights.

Can using k=5 be justified even if it's not an elbow point?

Or is kmeans not a good option in the first place?

Upvotes: 4

Views: 4384

Answers (3)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

I don't think k-means on such features solves your problem. You probably need to rethink your approach. In particular, pay attention to what function you optimize (what does SSE mean for your task?) - using the wrong function on the wrong features can mean you get the answer to a different question...

He elbow method is horribly unreliable and I wish people would finally stop to even mention it. If you use itz the first question you should ask is: does the curve look like a typical curve on random data where there is no k? If so, stop completely and redo your approach, because it looks like your data is bad - or at least, k-means does not work. You are exactly in this situation: the plot suggests that k-means does not work on your data.

Upvotes: 5

chitown88
chitown88

Reputation: 28565

One manner to choose the number of clusters is the ‘elbow method’. As explained by machine learning expert Andrew Ng, by calculating the distortion value for each k number of clusters, you can plot that value against the number of clusters. A suitable k value may be identified where the distortion value begins to decrease at a lower rate, which is depicted in Ng’s example in Figure below, at k = 3 (Ng, no date a). The problem arises when the distortion value decreases at a steady rate, creating a smooth curve, as Ng exemplifies on the right in that Figure. There is no distinct ‘joint’ to identify an ‘elbow’.

enter image description here

When I was working on my dissertation, my data fell in to the latter (see below - what should I choose for K?? - it ended up being 4 when doing silhouette analysis)

enter image description here

which meant I needed to find an alternative method. That alternative method was via silhouette analysis. As explained in the Scikit-Learn documentation, silhouette analysis is explored to gain understanding of the separation of the clusters.

A silhouette coefficient of a cluster is scored from -1 to +1. A score near +1 indicates the sample as distant from neighbouring clusters, and thus represents the sample as distinct to the cluster. A score of zero implies the samples are on the border or close to the decision boundary of clusters. A silhouette score of -1 signifies the samples are assigned to the wrong cluster (Selecting the number of clusters with silhouette analysis on KMeans clustering — scikitlearn 0.19.1 documentation, 2017). When visualizing the distribution of the observations in the clusters and the silhouette coefficient values relative to other clusters, like the ‘elbow method’, one can visually identify an appropriate k value. The intention is to select a k value in which the number of samples within each cluster are relatively the same, while the majority of samples maintain above average silhouette scores.

I'd suggest giving that a try (even if there is a clear "elbow") to a) verify that you choosing an appropriate k value, and b) it's just nice to practice and see/understand alternative methods.

Upvotes: 4

bglbrt
bglbrt

Reputation: 2098

The elbow-point is not a definitive rule but is more of a heuristic method (it works most of the time but not always, so I see it more like is a good rule-of-thumb for choosing a number of clusters to start from). On top of that, the elbow-point cannot always be unambiguously identified so you shouldn't worry too much about it.

So in that case, if you get better results/gain in how you understand your data using k=5, then I would highly suggest you to use k=5 rather than k=3!

Now, for your other question, there may be approaches that would better suit your data but it doesn't mean k-means isn't a good way to start. If you want to try other things, the scikit-learn library documentation provides good insights on which algorithm or method to use when doing clustering.

Upvotes: 6

Related Questions