mks
mks

Reputation: 77

break tie between k-means cluster

I have dataset on which I applied k-means and there are two clusters but distance from a particular point (x,y) to both cluster is same ,then in which cluster the point will go. please help me. Thanks in advance.

Upvotes: 1

Views: 2267

Answers (1)

Maurits Evers
Maurits Evers

Reputation: 50678

tldr;

In the case of ties, k-means clustering will randomly assign the ambiguous point to a cluster. (This is based on R's implementation of k-means clustering kmeans.)


A specific example based on the iris data in R

  1. Let's start by loading necessary R libraries

    library(broom)
    library(tidyverse)
    
  2. For this example, we will use the Petal.Length and Petal.Width measurements from the iris dataset, and for simplicity exclude the "virginica" measurements so that the "setosa" and "versicolor" measurements form our two groups.

    df <- iris %>%
        filter(Species != "virginica") %>%
        select(starts_with("Petal"), Species)
    
  3. We now use k-means clustering with k = 2, and assign a cluster label to every (Petal.Length, Petal.Width) measurement; since the assignment of which group is "1" and which group is "2" is random, we use a fixed seed for reproducibility.

    set.seed(2018)
    kcl <- kmeans(df %>% select(-Species), 2)
    df <- augment(kcl, df)
    
  4. We show a scatterplot of Petal.Length vs. Petal.Width; the known Species labels are shown by the different colours and the inferred cluster association by the different symbols.

    ggplot(df, aes(Petal.Length, Petal.Width, colour = Species)) +
        geom_point(aes(shape = .cluster), size = 3)
    

    enter image description here

  5. Let's manually calculate the within-cluster sum of squared pairwise distances; since we'll be needing this later as well, we'll create a function calculate_d.

    calculate_d <- function(df) {
        df %>%
            select(.cluster, Petal.Length, Petal.Width) %>%
            group_by(.cluster) %>%
            nest() %>%
            mutate(dist = map_dbl(data, ~sum(as.matrix(dist(.x)^2)) / (2 * nrow(.x)))) %>%
            pull(dist)
    }
    
    calculate_d(df)
    #[1]  2.0220 12.7362
    

    Notice how the distances are identical to the within-cluster sum of squares (WCSS)

    kcl$withinss
    #[1]  2.0220 12.7362
    
  6. Now let's add a new measurement that has the same Euclidean distance from both cluster centers: to do so, we choose the point that lies exactly half-way between both cluster centers if you connect them by a straight line. All we need then is a bit of basic trigonometry to construct that point:

    z <- kcl$centers[2, ] - kcl$center[1, ]
    theta <- atan(z[2] / z[1])
    
    dy <- sin(theta) * dist(kcl$centers) / 2
    dx <- cos(theta) * dist(kcl$centers) / 2
    
    x <- as.numeric(kcl$centers[1, 1] + dx)
    y <- as.numeric(kcl$centers[1, 2] + dy)
    

    We store our new point together with the 2 cluster centers in a new data.frame. The first two rows give the position of cluster "1" and "2", and the third row contains our new point.

    df2 <- bind_rows(as.data.frame(kcl$centers), c(Petal.Length = x, Petal.Width = y))
    

    Let's show the new point together with the cluster centers on top of our (Petal.Length, Petal.Width) measurements.

    df2 <- bind_rows(as.data.frame(kcl$centers), c(Petal.Length = x, Petal.Width = y))
    ggplot(df, aes(Petal.Length, Petal.Width)) +
        geom_point(aes(colour = Species, shape = .cluster), size = 3) +
        geom_point(data = df2, aes(Petal.Length, Petal.Width), size = 4)
    

    enter image description here

    We confirm that the squared Euclidean distance between the new point and each cluster center is indeed the same; to do so we calculate the pairwise distances of our new point "3" to the cluster centres "1" and "2":

    as.matrix(dist(df2))[, 3]
    #     1      2      3
    #1.4996 1.4996 0.0000
    
  7. Now let's add our new point to the (Petal.Length,Petal.Width) measurements, and calculate the within cluster sum of squared pairwise distances, first with assigning our new point to cluster "1" and then with assigning our new point to cluster "2".

    # Add new point and assign to cluster "1"
    df.1 <- df %>%
        bind_rows(cbind.data.frame(
            Petal.Length = x,
            Petal.Width = y,
            Species = factor("setosa", levels = levels(df$Species)),
            .cluster = factor(1, levels = 1:2)))
    calculate_d(df.1)
    #[1]  4.226707 12.736200
    
    # Add new point and assign to cluster "2"
    df.2 <- df %>%
        bind_rows(cbind.data.frame(
            Petal.Length = x,
            Petal.Width = y,
            Species = factor("versicolor", levels = levels(df$Species)),
            .cluster = factor(2, levels = 1:2)))
    calculate_d(df.2)
    #[1]  2.02200 14.94091
    

    Notice how the within-cluster squared pairwise distances are different even though the new point has exactly the same distances from either cluster centre. However notice also, how the sum of the within-cluster squared pairwise distances is the same!

    sum(calculate_d(df.1))
    #[1] 16.96291
    
    sum(calculate_d(df.2))
    #[1] 16.96291
    
    identical(sum(calculate_d(df.2)), sum(calculate_d(df.1)))
    # [1] TRUE
    
  8. To show that kmeans assigns the new point at random to either cluster we repeatedly cluster the data. To do so, we define a convenience function that returns the corresponding Species of the new point following k-means clustering.

    kmeans_cluster_data <- function(df) {
        kcl <- kmeans(df %>% select(-Species), 2)
        df <- augment(kcl, df)
        map_cluster_to_Species <- df[1:(nrow(df) - 1), ] %>%
            count(Species, .cluster) %>%
            split(., .$.cluster)
        map_cluster_to_Species[[
            df[nrow(df), ] %>%
                pull(.cluster) %>%
                as.character()]]$Species %>% as.character()
    }
    

    We now repeatedly cluster the same data 100 times.

    bind_cols(
        Iteration = 1:100,
        Species = map_chr(1:100, ~kmeans_cluster_data(df.1 %>% select(-.cluster)))) %>%
    ggplot(aes(Iteration, Species, group = 1)) +
        geom_line() +
        labs(title = "Assignment of new point to group")
    

    enter image description here

    Notice how the new point gets assigned to either Species group at random.

Upvotes: 4

Related Questions