Reputation: 77

break tie between k-means cluster

I have dataset on which I applied k-means and there are two clusters but distance from a particular point (x,y) to both cluster is same ,then in which cluster the point will go. please help me. Thanks in advance.

Upvotes: 1

Answers (1)

Maurits Evers

Reputation: 50678

tldr;

In the case of ties, k-means clustering will randomly assign the ambiguous point to a cluster. (This is based on R's implementation of k-means clustering kmeans.)

A specific example based on the `iris` data in R

Let's start by loading necessary R libraries
```
library(broom)
library(tidyverse)
```
For this example, we will use the Petal.Length and Petal.Width measurements from the iris dataset, and for simplicity exclude the "virginica" measurements so that the "setosa" and "versicolor" measurements form our two groups.
```
df <- iris %>%
    filter(Species != "virginica") %>%
    select(starts_with("Petal"), Species)
```
We now use k-means clustering with k = 2, and assign a cluster label to every (Petal.Length, Petal.Width) measurement; since the assignment of which group is "1" and which group is "2" is random, we use a fixed seed for reproducibility.
```
set.seed(2018)
kcl <- kmeans(df %>% select(-Species), 2)
df <- augment(kcl, df)
```
We show a scatterplot of Petal.Length vs. Petal.Width; the known Species labels are shown by the different colours and the inferred cluster association by the different symbols.
```
ggplot(df, aes(Petal.Length, Petal.Width, colour = Species)) +
    geom_point(aes(shape = .cluster), size = 3)
```

Let's manually calculate the within-cluster sum of squared pairwise distances; since we'll be needing this later as well, we'll create a function calculate_d.

calculate_d <- function(df) {
    df %>%
        select(.cluster, Petal.Length, Petal.Width) %>%
        group_by(.cluster) %>%
        nest() %>%
        mutate(dist = map_dbl(data, ~sum(as.matrix(dist(.x)^2)) / (2 * nrow(.x)))) %>%
        pull(dist)
}

calculate_d(df)
#[1]  2.0220 12.7362

Notice how the distances are identical to the within-cluster sum of squares (WCSS)

kcl$withinss
#[1]  2.0220 12.7362

Now let's add a new measurement that has the same Euclidean distance from both cluster centers: to do so, we choose the point that lies exactly half-way between both cluster centers if you connect them by a straight line. All we need then is a bit of basic trigonometry to construct that point:
```
z <- kcl$centers[2, ] - kcl$center[1, ]
theta <- atan(z[2] / z[1])

dy <- sin(theta) * dist(kcl$centers) / 2
dx <- cos(theta) * dist(kcl$centers) / 2

x <- as.numeric(kcl$centers[1, 1] + dx)
y <- as.numeric(kcl$centers[1, 2] + dy)
```
We store our new point together with the 2 cluster centers in a new data.frame. The first two rows give the position of cluster "1" and "2", and the third row contains our new point.
```
df2 <- bind_rows(as.data.frame(kcl$centers), c(Petal.Length = x, Petal.Width = y))
```
Let's show the new point together with the cluster centers on top of our (Petal.Length, Petal.Width) measurements.
```
df2 <- bind_rows(as.data.frame(kcl$centers), c(Petal.Length = x, Petal.Width = y))
ggplot(df, aes(Petal.Length, Petal.Width)) +
    geom_point(aes(colour = Species, shape = .cluster), size = 3) +
    geom_point(data = df2, aes(Petal.Length, Petal.Width), size = 4)
```
We confirm that the squared Euclidean distance between the new point and each cluster center is indeed the same; to do so we calculate the pairwise distances of our new point "3" to the cluster centres "1" and "2":
```
as.matrix(dist(df2))[, 3]
#     1      2      3
#1.4996 1.4996 0.0000
```

Now let's add our new point to the (Petal.Length,Petal.Width) measurements, and calculate the within cluster sum of squared pairwise distances, first with assigning our new point to cluster "1" and then with assigning our new point to cluster "2".

# Add new point and assign to cluster "1"
df.1 <- df %>%
    bind_rows(cbind.data.frame(
        Petal.Length = x,
        Petal.Width = y,
        Species = factor("setosa", levels = levels(df$Species)),
        .cluster = factor(1, levels = 1:2)))
calculate_d(df.1)
#[1]  4.226707 12.736200

# Add new point and assign to cluster "2"
df.2 <- df %>%
    bind_rows(cbind.data.frame(
        Petal.Length = x,
        Petal.Width = y,
        Species = factor("versicolor", levels = levels(df$Species)),
        .cluster = factor(2, levels = 1:2)))
calculate_d(df.2)
#[1]  2.02200 14.94091

Notice how the within-cluster squared pairwise distances are different even though the new point has exactly the same distances from either cluster centre. However notice also, how the sum of the within-cluster squared pairwise distances is the same!

sum(calculate_d(df.1))
#[1] 16.96291

sum(calculate_d(df.2))
#[1] 16.96291

identical(sum(calculate_d(df.2)), sum(calculate_d(df.1)))
# [1] TRUE

To show that kmeans assigns the new point at random to either cluster we repeatedly cluster the data. To do so, we define a convenience function that returns the corresponding Species of the new point following k-means clustering.

kmeans_cluster_data <- function(df) {
    kcl <- kmeans(df %>% select(-Species), 2)
    df <- augment(kcl, df)
    map_cluster_to_Species <- df[1:(nrow(df) - 1), ] %>%
        count(Species, .cluster) %>%
        split(., .$.cluster)
    map_cluster_to_Species[[
        df[nrow(df), ] %>%
            pull(.cluster) %>%
            as.character()]]$Species %>% as.character()
}

We now repeatedly cluster the same data 100 times.

bind_cols(
    Iteration = 1:100,
    Species = map_chr(1:100, ~kmeans_cluster_data(df.1 %>% select(-.cluster)))) %>%
ggplot(aes(Iteration, Species, group = 1)) +
    geom_line() +
    labs(title = "Assignment of new point to group")

Notice how the new point gets assigned to either Species group at random.

Upvotes: 4

break tie between k-means cluster

Answers (1)

tldr;

A specific example based on the iris data in R

Related Questions

A specific example based on the `iris` data in R