user9562553
user9562553

Reputation:

what does precompute_distances do in sklearn kmeans method?

I am looking for the utility of the precompute_distances attribute:

    class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, 
    max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, 
    random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)

Which distances it precomputes?

Upvotes: 2

Views: 3000

Answers (1)

Bert Kellerman
Bert Kellerman

Reputation: 1629

For each kmeans iteration, we need to find the closest cluster to each sample to perform labeling. If pre_compute == True, this is done via metrics.pairwise_distances_argmin_min(). If pre_compute == False, it is done via cluster._k_means._assign_labels_array()

https://github.com/scikit-learn/scikit-learn/blob/a24c8b464d094d2c468a16ea9f8bf8d42d949f84/sklearn/cluster/k_means_.py#L618

The first method uses matrix operations, while the latter computes pairwise distances one pair at a time. That's why precompute = True will be faster but will use more memory.

These minimum distances can not be cached between iterations because the kmeans centers will be changing .

Upvotes: 6

Related Questions