Reputation: 77
I have dataset on which I applied k-means and there are two clusters but distance from a particular point (x,y) to both cluster is same ,then in which cluster the point will go. please help me. Thanks in advance.
Upvotes: 1
Views: 2267
Reputation: 50678
In the case of ties, k-means clustering will randomly assign the ambiguous point to a cluster. (This is based on R's implementation of k-means clustering kmeans
.)
iris
data in RLet's start by loading necessary R libraries
library(broom)
library(tidyverse)
For this example, we will use the Petal.Length
and Petal.Width
measurements from the iris
dataset, and for simplicity exclude the "virginica" measurements so that the "setosa" and "versicolor" measurements form our two groups.
df <- iris %>%
filter(Species != "virginica") %>%
select(starts_with("Petal"), Species)
We now use k-means clustering with k = 2, and assign a cluster label to every (Petal.Length
, Petal.Width
) measurement; since the assignment of which group is "1" and which group is "2" is random, we use a fixed seed for reproducibility.
set.seed(2018)
kcl <- kmeans(df %>% select(-Species), 2)
df <- augment(kcl, df)
We show a scatterplot of Petal.Length
vs. Petal.Width
; the known Species
labels are shown by the different colours and the inferred cluster association by the different symbols.
ggplot(df, aes(Petal.Length, Petal.Width, colour = Species)) +
geom_point(aes(shape = .cluster), size = 3)
Let's manually calculate the within-cluster sum of squared pairwise distances; since we'll be needing this later as well, we'll create a function calculate_d
.
calculate_d <- function(df) {
df %>%
select(.cluster, Petal.Length, Petal.Width) %>%
group_by(.cluster) %>%
nest() %>%
mutate(dist = map_dbl(data, ~sum(as.matrix(dist(.x)^2)) / (2 * nrow(.x)))) %>%
pull(dist)
}
calculate_d(df)
#[1] 2.0220 12.7362
Notice how the distances are identical to the within-cluster sum of squares (WCSS)
kcl$withinss
#[1] 2.0220 12.7362
Now let's add a new measurement that has the same Euclidean distance from both cluster centers: to do so, we choose the point that lies exactly half-way between both cluster centers if you connect them by a straight line. All we need then is a bit of basic trigonometry to construct that point:
z <- kcl$centers[2, ] - kcl$center[1, ]
theta <- atan(z[2] / z[1])
dy <- sin(theta) * dist(kcl$centers) / 2
dx <- cos(theta) * dist(kcl$centers) / 2
x <- as.numeric(kcl$centers[1, 1] + dx)
y <- as.numeric(kcl$centers[1, 2] + dy)
We store our new point together with the 2 cluster centers in a new data.frame
. The first two rows give the position of cluster "1" and "2", and the third row contains our new point.
df2 <- bind_rows(as.data.frame(kcl$centers), c(Petal.Length = x, Petal.Width = y))
Let's show the new point together with the cluster centers on top of our (Petal.Length
, Petal.Width
) measurements.
df2 <- bind_rows(as.data.frame(kcl$centers), c(Petal.Length = x, Petal.Width = y))
ggplot(df, aes(Petal.Length, Petal.Width)) +
geom_point(aes(colour = Species, shape = .cluster), size = 3) +
geom_point(data = df2, aes(Petal.Length, Petal.Width), size = 4)
We confirm that the squared Euclidean distance between the new point and each cluster center is indeed the same; to do so we calculate the pairwise distances of our new point "3" to the cluster centres "1" and "2":
as.matrix(dist(df2))[, 3]
# 1 2 3
#1.4996 1.4996 0.0000
Now let's add our new point to the (Petal.Length
,Petal.Width
) measurements, and calculate the within cluster sum of squared pairwise distances, first with assigning our new point to cluster "1" and then with assigning our new point to cluster "2".
# Add new point and assign to cluster "1"
df.1 <- df %>%
bind_rows(cbind.data.frame(
Petal.Length = x,
Petal.Width = y,
Species = factor("setosa", levels = levels(df$Species)),
.cluster = factor(1, levels = 1:2)))
calculate_d(df.1)
#[1] 4.226707 12.736200
# Add new point and assign to cluster "2"
df.2 <- df %>%
bind_rows(cbind.data.frame(
Petal.Length = x,
Petal.Width = y,
Species = factor("versicolor", levels = levels(df$Species)),
.cluster = factor(2, levels = 1:2)))
calculate_d(df.2)
#[1] 2.02200 14.94091
Notice how the within-cluster squared pairwise distances are different even though the new point has exactly the same distances from either cluster centre. However notice also, how the sum of the within-cluster squared pairwise distances is the same!
sum(calculate_d(df.1))
#[1] 16.96291
sum(calculate_d(df.2))
#[1] 16.96291
identical(sum(calculate_d(df.2)), sum(calculate_d(df.1)))
# [1] TRUE
To show that kmeans
assigns the new point at random to either cluster we repeatedly cluster the data. To do so, we define a convenience function that returns the corresponding Species
of the new point following k-means clustering.
kmeans_cluster_data <- function(df) {
kcl <- kmeans(df %>% select(-Species), 2)
df <- augment(kcl, df)
map_cluster_to_Species <- df[1:(nrow(df) - 1), ] %>%
count(Species, .cluster) %>%
split(., .$.cluster)
map_cluster_to_Species[[
df[nrow(df), ] %>%
pull(.cluster) %>%
as.character()]]$Species %>% as.character()
}
We now repeatedly cluster the same data 100 times.
bind_cols(
Iteration = 1:100,
Species = map_chr(1:100, ~kmeans_cluster_data(df.1 %>% select(-.cluster)))) %>%
ggplot(aes(Iteration, Species, group = 1)) +
geom_line() +
labs(title = "Assignment of new point to group")
Notice how the new point gets assigned to either Species
group at random.
Upvotes: 4