Joud
Joud

Reputation: 11

K-Means taking a long time

I'm using k-means for my project for the first time. my dataset has more than 400,000 rows and 11 columns, I run the k-means for k= 3, 5, 7, 9, and 10. it took more than 65 minutes and still no output. is that normal? it's my first time so I'm not sure what to expect

I'm using python, visual studio

sse = []
silhouette_scores = []
k_values = [3, 5, 7, 9]

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=1, init='k-means++')
    kmeans.fit(x_pca)
    sse.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(x_pca, kmeans.labels_))

# the elbow method
plt.figure(figsize=(10, 6))
plt.plot(k_values, sse, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.show()

# silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

Upvotes: 1

Views: 71

Answers (1)

Teemu Risikko
Teemu Risikko

Reputation: 3275

Analysis

It's not your K-means that is slow, it's silhouette_score.

The time complexity of K-means is

𝑂( 𝑛 × 𝐾 × 𝐼 × 𝑑 )
◦ 𝑛 = number of points,
◦ 𝐾 = number of clusters,
◦ 𝐼 = number of iterations,
◦ 𝑑 = number of attributes

The hardest to evaluate of these is I, but it's safe to assume it's in the ballpark of ~10-50 iterations, maximum of 300 in sklearn default settings.

So for your dataset of 400 000 points, 3-9 clusters, 10-300 iterations and 11 attributes, the maximum complexity will be 11880000000, or 1.188x108. In practical terms, this took me 0.4 seconds for an example dataframe of your dimensions with k=3.

However, the complexity of silhouette score is O(n²), so in your case 160000000000 or 1.6x1011. That's about a 1000 fold increase compared to the worst case of K-means calculation. In reality, probably a 10000 fold increase, since the iterations are likely to be < 100. It took me 30 minutes to run even one pass with k=3.

What to do?

So for your question of "is that normal?", the answer is yes.

It's possible to calculate the distance matrix beforehand with something like sklearn pairwise_distances and then pass that to silhouette_score(precomputed_distance_matrix_of_x_pca, kmeans.labels_, metric="precomputed"), so that you don't need to do that on every iteration. However, it's unlikely to give any significant speedup, and you might easily run into this:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 1.16 TiB for an array with shape (400000, 400000) and data type float64

Oopsie.

Another option mentioned elsewhere is to add n_jobs=-1 to the parameters of silhouette_score, which will use all of the CPU cores for the distance calculation. This might marginally improve the speeds, but for me personally a second run with this took 60 minutes with k=3 (so double the time), possibly because I was running the code inside a devcontainer.

Your best option is probably to look for an alternative approach on selecting the cluster numbers, or just let the code run as long as it needs. Consider at least adding prints in between the loops to output the progress.

Upvotes: 2

Related Questions