user9798936
user9798936

Reputation:

how to extract observation values for each cluster of Kmeans

I have data that come from two distribution functions (mixture data). I fit the k-means to the data with $2$ centers. I then get the clusters. My point here is, instead of the number of each cluster, I would like to divide my data into two groups. That is, the first group contains the data that comes from the first cluster and the same for the second group (my data is two dimensions and a matrix).

Here is my try:

kme <- kmeans(Sim, 2)
kme$cluster 

which gives this:

kme$cluster
  [1] 1 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 1 2 2 1 2 1 2 2 2 1 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 2 2 1 2 1 2 2 2
 [56] 1 2 1 2 2 1 2 1 1 2 2 1 2 2 1 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 1 2 2 1 2

I know that means the first row (observations in the first row) of my matrix comes from the first cluster and the second and third rows are from the second cluster. Instead of this, I want two groups, one with the observations (the values not the number of the cluster) of the first cluster, and the other come from the second cluster.

For example,

          [,1]      [,2]  [,3]
[1,] 0.8026952 0.8049413    1
[2,] 0.4333745 0.5063472    2 
[3,] 0.3587946 0.4091627    2
[4,] 0.9067146 0.9211618    1 
[5,] 0.6663730 0.6644439    1 
[6,] 0.9752217 0.8299001    1 

Hence, I want it like this:

Group_1
         [,1]      [,2]  
    [1,] 0.8026952 0.8049413    
    [2,] 0.9067146 0.9211618    
    [3,] 0.6663730 0.6644439    
    [4,] 0.9752217 0.8299001  



Group_2

    [2,] 0.4333745 0.5063472    
    [3,] 0.3587946 0.4091627    




## my data 
structure(c(0.8026952064848, 0.433374540465373, 0.35879457564118, 
0.906714606331661, 0.666372966486961, 0.975221659988165, 0.146514602801487, 
0.185211665343342, 0.266845172200967, 0.9316249943804, 0.458760005421937, 
0.260092565789819, 0.546946153900359, 0.320214906940237, 0.998543527442962, 
0.264783770404576, 0.940526409307495, 0.218771387590095, 0.00109510733232848, 
0.909367726704406, 0.195467973826453, 0.853418850837688, 0.257240866776556, 
0.18492349224921, 0.0350681275368262, 0.743108308431699, 0.120800079312176, 
0.536067422405767, 0.387076289858669, 0.859893148997799, 0.962759922724217, 
0.0288314732712864, 0.878663770621642, 0.98208610656754, 0.98423704248853, 
0.0850008164197942, 0.415692074922845, 0.725441533140838, 0.514739896170795, 
0.564903213409707, 0.65493689605431, 0.551635805051774, 0.20452569425106, 
0.0509099354967475, 0.646801606381046, 0.656341063790023, 0.706781879998744, 
0.244539211907925, 0.43318469475677, 0.848426640266553, 0.26359805940462, 
0.730860544172275, 0.405211122473702, 0.401496034115553, 0.432796132021846, 
0.654138915939257, 0.00803712895140052, 0.991968845921972, 0.0311756118742527, 
0.0648601313587278, 0.733741108178729, 0.0431173096876591, 0.619796682847664, 
0.804308546474203, 0.0934691624715924, 0.520366458455101, 0.833598382357762, 
0.373484763782471, 0.261487311183624, 0.822368689114228, 0.88254910800606, 
0.261728620579622, 0.109025254459585, 0.661885950024542, 0.231851563323289, 
0.46855820226483, 0.909970719134435, 0.799321972066537, 0.646252158097923, 
0.233985049184412, 0.309839888018159, 0.129971102112904, 0.0901338488329202, 
0.460395671925082, 0.274646409088746, 0.675003502921675, 0.00289221783168614, 
0.336108531044562, 0.371105678845197, 0.607435576152056, 0.156731446506456, 
0.246894558891654, 0.418194083335386, 0.000669385509081014, 0.929943428778418, 
0.972200238145888, 0.503282874496368, 0.126382717164233, 0.683936105109751, 
0.21720214970307, 0.804941252722838, 0.506347232734472, 0.409162739287115, 
0.921161751145135, 0.664443932378791, 0.829900114789874, 0.0660539097664178, 
0.296326436845226, 0.120007439729838, 0.768823563807157, 0.449026418114183, 
0.268668511775742, 0.733763495587273, 0.365402223476625, 0.97980160509396, 
0.335119241818387, 0.929315469866307, 0.253016166717649, 0.00521095494948787, 
0.870041067705, 0.215020805969677, 0.858896143709886, 0.167998804405928, 
0.204213777320881, 0.050652931423494, 0.731499125526297, 0.166061290725948, 
0.520575411719918, 0.370579454420263, 0.655607928337889, 0.978414469097905, 
0.00268175014874324, 0.937587480238656, 0.992468047261219, 0.856301580636229, 
0.106064732119751, 0.530228247677302, 0.502227925225818, 0.66462369930413, 
0.526988978414104, 0.394591213637187, 0.623968017885322, 0.222666427921132, 
0.0707407196787662, 0.715361864683925, 0.561951996212598, 0.874765155771585, 
0.217631973951671, 0.576708062239157, 0.910641489550344, 0.215463715360162, 
0.761807500922947, 0.417110771840405, 0.497162608159201, 0.530665309105489, 
0.689703677933362, 0.00811876221245061, 0.991245541114815, 0.0518070069187705, 
0.0733367055960226, 0.803126294581356, 0.0291602667026993, 0.724848517465592, 
0.682316094846719, 0.0914714514707226, 0.426956537783392, 0.826985575416605, 
0.3128962286514, 0.295208624024388, 0.58934716401092, 0.856718183582533, 
0.183019143019377, 0.302561606994597, 0.666755501118539, 0.176298329811281, 
0.389183841328174, 0.86253900906311, 0.753736534075238, 0.627220192419063, 
0.319958512526359, 0.321602248149364, 0.161772830672492, 0.103166641060684, 
0.339980194505715, 0.218533019046996, 0.689884789678819, 0.00251942038852481, 
0.174792447835404, 0.509071373135409, 0.647835095901117, 0.22572898134156, 
0.287369659385574, 0.538675651472693, 0.000995476493411555, 0.939528694637273, 
0.961510166904661, 0.452822116916426, 0.2061782381611, 0.722694525115558, 
0.328404467661884), .Dim = c(100L, 2L))

Upvotes: 1

Views: 1113

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21908

I hope this is what you are looking for.

  • I had to transform the matrix to a data frame so that when we use split function the structure will be preserved, otherwise it would split the whole matrix element by element as matrix is actually a vector that has dim attribute. So it behaves like a vector
  • split function divides a data frame or a vector into groups defined by f. which in your case are unique cluster values
kme <- kmeans(Sim, 2)
kme$cluster 

Sim2 <- as.data.frame(cbind(Sim, kme$cluster))
split(Sim2, Sim2$V3) |>
  setNames(paste("Group", sort(unique(kme$cluster))))

$`Group 1`
              V1           V2 V3
2   0.4333745405 0.5063472327  1
3   0.3587945756 0.4091627393  1
7   0.1465146028 0.0660539098  1
8   0.1852116653 0.2963264368  1
9   0.2668451722 0.1200074397  1
11  0.4587600054 0.4490264181  1
12  0.2600925658 0.2686685118  1
14  0.3202149069 0.3654022235  1
16  0.2647837704 0.3351192418  1
18  0.2187713876 0.2530161667  1
19  0.0010951073 0.0052109549  1
21  0.1954679738 0.2150208060  1
23  0.2572408668 0.1679988044  1
24  0.1849234922 0.2042137773  1
25  0.0350681275 0.0506529314  1
27  0.1208000793 0.1660612907  1
29  0.3870762899 0.3705794544  1
32  0.0288314733 0.0026817501  1
36  0.0850008164 0.1060647321  1
37  0.4156920749 0.5302282477  1
43  0.2045256943 0.2226664279  1
44  0.0509099355 0.0707407197  1
48  0.2445392119 0.2176319740  1
49  0.4331846948 0.5767080622  1
51  0.2635980594 0.2154637154  1
53  0.4052111225 0.4171107718  1
54  0.4014960341 0.4971626082  1
55  0.4327961320 0.5306653091  1
57  0.0080371290 0.0081187622  1
59  0.0311756119 0.0518070069  1
60  0.0648601314 0.0733367056  1
62  0.0431173097 0.0291602667  1
65  0.0934691625 0.0914714515  1
66  0.5203664585 0.4269565378  1
68  0.3734847638 0.3128962287  1
69  0.2614873112 0.2952086240  1
72  0.2617286206 0.1830191430  1
73  0.1090252545 0.3025616070  1
75  0.2318515633 0.1762983298  1
76  0.4685582023 0.3891838413  1
80  0.2339850492 0.3199585125  1
81  0.3098398880 0.3216022481  1
82  0.1299711021 0.1617728307  1
83  0.0901338488 0.1031666411  1
84  0.4603956719 0.3399801945  1
85  0.2746464091 0.2185330190  1
87  0.0028922178 0.0025194204  1
88  0.3361085310 0.1747924478  1
89  0.3711056788 0.5090713731  1
91  0.1567314465 0.2257289813  1
92  0.2468945589 0.2873696594  1
93  0.4181940833 0.5386756515  1
94  0.0006693855 0.0009954765  1
97  0.5032828745 0.4528221169  1
98  0.1263827172 0.2061782382  1
100 0.2172021497 0.3284044677  1

$`Group 2`
          V1        V2 V3
1  0.8026952 0.8049413  2
4  0.9067146 0.9211618  2
5  0.6663730 0.6644439  2
6  0.9752217 0.8299001  2
10 0.9316250 0.7688236  2
13 0.5469462 0.7337635  2
15 0.9985435 0.9798016  2
17 0.9405264 0.9293155  2
20 0.9093677 0.8700411  2
22 0.8534189 0.8588961  2
26 0.7431083 0.7314991  2
28 0.5360674 0.5205754  2
30 0.8598931 0.6556079  2
31 0.9627599 0.9784145  2
33 0.8786638 0.9375875  2
34 0.9820861 0.9924680  2
35 0.9842370 0.8563016  2
38 0.7254415 0.5022279  2
39 0.5147399 0.6646237  2
40 0.5649032 0.5269890  2
41 0.6549369 0.3945912  2
42 0.5516358 0.6239680  2
45 0.6468016 0.7153619  2
46 0.6563411 0.5619520  2
47 0.7067819 0.8747652  2
50 0.8484266 0.9106415  2
52 0.7308605 0.7618075  2
56 0.6541389 0.6897037  2
58 0.9919688 0.9912455  2
61 0.7337411 0.8031263  2
63 0.6197967 0.7248485  2
64 0.8043085 0.6823161  2
67 0.8335984 0.8269856  2
70 0.8223687 0.5893472  2
71 0.8825491 0.8567182  2
74 0.6618860 0.6667555  2
77 0.9099707 0.8625390  2
78 0.7993220 0.7537365  2
79 0.6462522 0.6272202  2
86 0.6750035 0.6898848  2
90 0.6074356 0.6478351  2
95 0.9299434 0.9395287  2
96 0.9722002 0.9615102  2
99 0.6839361 0.7226945  2

Upvotes: 2

MonJeanJean
MonJeanJean

Reputation: 2906

Add the kme$cluster values to the original dataframe and then create a new dataframe with each column based on the value in kme$cluster

From what I understand without a data sample:

library(tidyverse)
Sim <- Sim %>%
 mutate(cluster_group = kme$cluster)

df_final <- data.frame(Group1 = Sim %>%
                         filter(cluster_group == 1) %>%
                         select(value) %>%
                         pull(),
                       Group2 = Sim %>%
                         filter(cluster_group== 2) %>%
                         select(value) %>%
                         pull())

With value the values used for the kmeans in Sim

Upvotes: 1

Related Questions