Mal_a
Mal_a

Reputation: 3760

R Unsupervised Clustering by group (?)

My main and most important goal is actually to find the groups that have many points appearing on the same line after each other, my idea was to do it with help of kmeans but maybe You have better idea.

I am going to explain it on base of two plots which You can find below (each plot describes one group):

Plot 1 for the Group 1: enter image description here

We can see that there are many points laying on almost same y axis --> and im trying to figure out how to find the groups having such a "points distribution"

Below we have plot 2 of Group 2 that does not show such a "points distribution"

enter image description here

Here we can find the data that corresponds to both plots above:

structure(list(Group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1), 
    x = c(100L, 150L, 250L, 287L, 312L, 387L, 475L, 550L, 837L, 
    937L, 987L, 1087L, 1175L, 1300L, 1325L, 1487L, 1662L, 1700L, 
    1725L, 1812L, 1912L, 2412L, 3012L, 3562L, 4162L, 4762L, 5362L, 
    5750L, 5712L, 6225L, 6825L, 6887L, 7237L, 7850L, 7800L, 7937L, 
    7975L, 8275L, 8362L, 8662L, 8725L, 8950L, 9100L, 9312L, 9400L, 
    9600L, 550L, 612L, 1962L, 5412L, 8425L, 9375L, 5412L), y = c(493L, 
    482L, 479L, 476L, 481L, 479L, 474L, 480L, 480L, 491L, 489L, 
    490L, 485L, 485L, 485L, 479L, 482L, 482L, 482L, 482L, 484L, 
    489L, 491L, 489L, 496L, 498L, 500L, 0L, 498L, 500L, 502L, 
    506L, 497L, 0L, 495L, 506L, 497L, 494L, 498L, 500L, 496L, 
    499L, 496L, 495L, 495L, 498L, 442L, 447L, 394L, 465L, 806L, 
    700L, 502L)), row.names = c(23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 51L, 52L, 53L, 54L, 55L, 
56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 
69L, 574L, 575L, 576L, 577L, 578L, 579L, 815L), class = "data.frame")

Short Explanation:

Group   x   y
1 100 493
1 150 482
1 250 479
1 287 476
1 312 481
1 387 479

We have here each Group (1 & 2), x and y coordinates.

My approach till now:

I have rounded the y axis to 20 using this code

    round_any = function(x, accuracy, f=round){f(x/ accuracy) * accuracy} # function to round the y 
data$y_rd <- round_any(data$y, 20)

I have done that because usually points do not lay specifically on the same y line..

Furthermore i have used this code to create clusters per Group based on x coordinate for each y_rd (rounded y coordinate):

    data$id <- paste(data$Group, data$y_rd, sep = "_") # create id that contains Group and y_rd values
    res2 <- tapply(data$x, INDEX = data$id, function(x) kmeans(x,2)) # kmeans with fixed number of clusters    
    res3 <- lapply(names(res2), function(x) data.frame(y=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))     
    res3 <- do.call(rbind, res3)

However it is not working how I need it, as I am not able to define fix number of cluster for each Group and y_rd...

And at this point I am stuck and do not know what approach I can take to find Groups that have such a distribution...

Result that I would like to get:

Group Cluster MaxPoints
1      1         3
1      2         20
1      3         7

enter image description here

I am open for any ideas or tips that would help me find the Groups displaying such a muster. Thanks!

Upvotes: 0

Views: 313

Answers (1)

s__
s__

Reputation: 9485

Some points of your question are not clear to me, so here an answer, maybe it could be a starting point.

Due it seems that the most important variable is the y, you can try to study it in the groups, then apply k-means to the "winner" groups.

First you can try to detect the Groups that conceivably have a "line" distribution, looking at some boxplot, or some histograms:

dats %>% ggplot(aes(y_rd)) + geom_histogram() + facet_wrap(vars(Group)) + theme_light()

enter image description here

Now it seems there is a group with a long line and a smaller cluster (1) and a group with many small clusters(2).So in this case, you can divide your data in groups that have two clusters (and a long line), with 1, and a group with many "small clusters" with no long line (2). The idea is to divide your 100 groups in "no long line", "long line and 1 small cluster", "long line and 2 small clusters" and so on. Having those, you can split the dataset and perform clustering. In this case, we discard the second group, and use a k-means with 2 centers for the second, due it seems it has a long line and another small cluster.

vec <- c(1)  # vector of groups that seems they've long line

 # a loop to cluster them: clearly this is fixed to two clusters, looking at the
 # histograms you can do n loop, one for similar distributions
listed <- list()
for (i in vec){
  clustering <- kmeans(dats[dats$Group == 1,c(4)],2)
  listed[[i]] <- data.frame(dats[dats$Group == i,c(4)],cl = clustering$cluster)
}

Now you can plot it:

library(ggplot2)
ggplot(listed[[1]], aes(x,y, color = as.factor(cl))) + geom_point() + theme_light()

enter image description here

Upvotes: 1

Related Questions