Reputation: 3760
My main and most important goal is actually to find the groups that have many points appearing on the same line after each other, my idea was to do it with help of kmeans but maybe You have better idea.
I am going to explain it on base of two plots which You can find below (each plot describes one group):
We can see that there are many points laying on almost same y axis --> and im trying to figure out how to find the groups having such a "points distribution"
Below we have plot 2 of Group 2 that does not show such a "points distribution"
Here we can find the data that corresponds to both plots above:
structure(list(Group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1),
x = c(100L, 150L, 250L, 287L, 312L, 387L, 475L, 550L, 837L,
937L, 987L, 1087L, 1175L, 1300L, 1325L, 1487L, 1662L, 1700L,
1725L, 1812L, 1912L, 2412L, 3012L, 3562L, 4162L, 4762L, 5362L,
5750L, 5712L, 6225L, 6825L, 6887L, 7237L, 7850L, 7800L, 7937L,
7975L, 8275L, 8362L, 8662L, 8725L, 8950L, 9100L, 9312L, 9400L,
9600L, 550L, 612L, 1962L, 5412L, 8425L, 9375L, 5412L), y = c(493L,
482L, 479L, 476L, 481L, 479L, 474L, 480L, 480L, 491L, 489L,
490L, 485L, 485L, 485L, 479L, 482L, 482L, 482L, 482L, 484L,
489L, 491L, 489L, 496L, 498L, 500L, 0L, 498L, 500L, 502L,
506L, 497L, 0L, 495L, 506L, 497L, 494L, 498L, 500L, 496L,
499L, 496L, 495L, 495L, 498L, 442L, 447L, 394L, 465L, 806L,
700L, 502L)), row.names = c(23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 51L, 52L, 53L, 54L, 55L,
56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L,
69L, 574L, 575L, 576L, 577L, 578L, 579L, 815L), class = "data.frame")
Short Explanation:
Group x y
1 100 493
1 150 482
1 250 479
1 287 476
1 312 481
1 387 479
We have here each Group (1 & 2), x and y coordinates.
My approach till now:
I have rounded the y axis to 20 using this code
round_any = function(x, accuracy, f=round){f(x/ accuracy) * accuracy} # function to round the y
data$y_rd <- round_any(data$y, 20)
I have done that because usually points do not lay specifically on the same y line..
Furthermore i have used this code to create clusters per Group based on x coordinate for each y_rd (rounded y coordinate):
data$id <- paste(data$Group, data$y_rd, sep = "_") # create id that contains Group and y_rd values
res2 <- tapply(data$x, INDEX = data$id, function(x) kmeans(x,2)) # kmeans with fixed number of clusters
res3 <- lapply(names(res2), function(x) data.frame(y=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))
res3 <- do.call(rbind, res3)
However it is not working how I need it, as I am not able to define fix number of cluster for each Group and y_rd...
And at this point I am stuck and do not know what approach I can take to find Groups that have such a distribution...
Result that I would like to get:
Group Cluster MaxPoints
1 1 3
1 2 20
1 3 7
I am open for any ideas or tips that would help me find the Groups displaying such a muster. Thanks!
Upvotes: 0
Views: 313
Reputation: 9485
Some points of your question are not clear to me, so here an answer, maybe it could be a starting point.
Due it seems that the most important variable is the y
, you can try to study it in the groups, then apply k-means to the "winner" groups.
First you can try to detect the Groups that conceivably have a "line" distribution, looking at some boxplot, or some histograms:
dats %>% ggplot(aes(y_rd)) + geom_histogram() + facet_wrap(vars(Group)) + theme_light()
Now it seems there is a group with a long line and a smaller cluster (1) and a group with many small clusters(2).So in this case, you can divide your data in groups that have two clusters (and a long line), with 1, and a group with many "small clusters" with no long line (2). The idea is to divide your 100 groups in "no long line", "long line and 1 small cluster", "long line and 2 small clusters" and so on. Having those, you can split the dataset and perform clustering. In this case, we discard the second group, and use a k-means with 2 centers for the second, due it seems it has a long line and another small cluster.
vec <- c(1) # vector of groups that seems they've long line
# a loop to cluster them: clearly this is fixed to two clusters, looking at the
# histograms you can do n loop, one for similar distributions
listed <- list()
for (i in vec){
clustering <- kmeans(dats[dats$Group == 1,c(4)],2)
listed[[i]] <- data.frame(dats[dats$Group == i,c(4)],cl = clustering$cluster)
}
Now you can plot it:
library(ggplot2)
ggplot(listed[[1]], aes(x,y, color = as.factor(cl))) + geom_point() + theme_light()
Upvotes: 1