How does APPLY_KMEANS work in Vertica

Question

I am testing the machine learning tools in Vertica. I understand how the KMEANS work since it just devides the data into clusters. However I do not understand how the APPLY_KMEANS works on new data. It looks to me like it acts more like a classification method. Since it classifies new Data in the existing clusters. So what algorithm is used (K nearest neighbor)? Its not very clear from the documentation.

pltrdy · Accepted Answer

k-means is a clustering algorithm (not classification!) that iterates over 2 steps:

Assignement step: Assign each point a centroid
Update step: update centroids coordinates

When you build your k-means model, you first initialize centroids (different strategy, can be random initialization), then you iterate until your clustering is ok (your error is below a given threshold).

What defines your model is actually your centroids.

When using APPLY_KMEANS you will run an assignment step using data from your query and centroids from your model. Points will then be assigned to clusters depending on their distance with respect to centroids.

Hope it helps pltrdy

Note about Clustering vs Classification:
We can be tempted to think that clustering is a kind of classification. Still, classification must only refer to supervised learning while clustering corresponds to unsupervised learning. Thus, don't do it :)

How does APPLY_KMEANS work in Vertica

Answers (1)

Related Questions