BloopFloopy
BloopFloopy

Reputation: 139

R : help to analyse cluster content in hierarchical clustering

I have a dataframe of more than 5000 observations. In my attempt to analyse my data using hierarchical clustering, I have 8 clusters, where some of the clusters contain either a few 100 or 1000 individual observations.

# Cut tree into 8 groups
cutree_hclust <- cutree(hclust.unsupervised, k = 8)

# Number of members in each cluster
table(cutree_hclust)

Next is an illustration of the size of each cluster:

cutree_hclust
   1    2    3    4    5    6    7    8 
  867  61  14   310  1135  432   119  5

To get a view of what variable combination there is for each observation in the different clusters, I thought that it might be an idea to make the 8 clusters as dataframes, so I can analyse them separately. This because I have not idea what different rows are in the different columns and therefore don't know what the pattern in the overall datafram (Complete_df) is.

However, how can I make these new dataframes?

I can see what I assume to be the rows in the different clusters by, fx:

rownames(MY_df)[cutree_hclust == 7]

[1] "60"  "72"  "92"  "97"  "110" "210" "211" "267"
[9] "565"

But if I type:

h_clust <- as.dataframe( rownames(MY_df)[cutree_hclust == 7])

I only get a view (as a list) of what rows are in this cluster and all the other columns are not included.

How can i select these specific rows in my dataframe called: Complete_df - so that I can see what the overall variable combination is for each cluster?

I have tried the following:

rn <- rownames(MY_df)[cutree_hclust == 7]; subset(Complete_df, rn %in% rownames(MY_df))

- this from: R how to select several rows to make a new dataframe

and

Clust_7 <- rownames(MY_df)[cutree_hclust == 7]

Clust_7_df <- data.frame(matrix(unlist(Clust_7), nrow=9, byrow=T))

The above attempst did not work.

I look forward to hearing back from anyone who can help - as I have not been able to figure this out for myself :-)

Upvotes: 2

Views: 661

Answers (2)

Gwang-Jin Kim
Gwang-Jin Kim

Reputation: 9865

I take as an example data frame mtcars.

df <- mtcars

Now cluster the df:

hclust.unsupervised <- hclust(dist(df))

And create the cutree of it with k = 8

cutree_hclust <- cutree(hclust.unsupervised, k = 8)

str(cutree_hclust) shows that it is an integer vector with the cluster number assigned to the name of each row in df.

Thus the best would be, to add this vector as an additional column to your original data frame:

df$cluster <- cutree_hclust

Now, you can split this original data frame to a list of sub-data frames by the df$cluster column value:

df.list <- split(df, df$cluster)

I think this list of data frames contains the sub-dataframes for each cluster which was what you wanted.

$`1`
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb cluster
Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4       1
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4       1
Datsun 710    22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1       1
Merc 240D     24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2       1
Merc 230      22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       1
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4       1
Merc 280C     17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4       1
Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1       1
Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2       1
Lotus Europa  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2       1
Volvo 142E    21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2       1

$`2`
                mpg cyl disp  hp drat    wt  qsec vs am gear carb cluster
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1       2
Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1       2

$`3`
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb cluster
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2       3
Duster 360        14.3   8  360 245 3.21 3.570 15.84  0  0    3    4       3
Camaro Z28        13.3   8  350 245 3.73 3.840 15.41  0  0    3    4       3
Pontiac Firebird  19.2   8  400 175 3.08 3.845 17.05  0  0    3    2       3
Ford Pantera L    15.8   8  351 264 4.22 3.170 14.50  0  1    5    4       3

$`4`
                  mpg cyl  disp  hp drat    wt  qsec vs am gear carb cluster
Merc 450SE       16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3       4
Merc 450SL       17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3       4
Merc 450SLC      15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3       4
Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2       4
AMC Javelin      15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2       4

$`5`
                     mpg cyl disp  hp drat    wt  qsec vs am gear carb cluster
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4       5
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4       5
Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4       5

$`6`
                mpg cyl disp hp drat    wt  qsec vs am gear carb cluster
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1       6
Honda Civic    30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2       6
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1       6
Fiat X1-9      27.3   4 79.0 66 4.08 1.935 18.90  1  1    4    1       6

$`7`
              mpg cyl disp  hp drat   wt qsec vs am gear carb cluster
Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6       7

$`8`
              mpg cyl disp  hp drat   wt qsec vs am gear carb cluster
Maserati Bora  15   8  301 335 3.54 3.57 14.6  0  1    5    8       8

Since the name of the df.list elements is the cluster number, you can get the data frame for cluster 3 e.g. by calling

df.list[[3]]

which gives:

                       mpg cyl disp  hp drat    wt  qsec vs am gear carb cluster
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2       3
Duster 360        14.3   8  360 245 3.21 3.570 15.84  0  0    3    4       3
Camaro Z28        13.3   8  350 245 3.73 3.840 15.41  0  0    3    4       3
Pontiac Firebird  19.2   8  400 175 3.08 3.845 17.05  0  0    3    2       3
Ford Pantera L    15.8   8  351 264 4.22 3.170 14.50  0  1    5    4       3

Upvotes: 1

kangaroo_cliff
kangaroo_cliff

Reputation: 6222

I think your aim is to separate clusters into individual data.frames. If so, the following could help.

It uses the USArrests data. Only a part of the final output is shown.

hc <- hclust(dist(USArrests), "ave")
k <- 8 
cutree_hclust <- cutree(hc, k = k)
df_list <- lapply(1 : k, function(x) USArrests[which(cutree_hclust == x), ])

df_list
# [[1]]
# Murder Assault UrbanPop Rape
# Alabama          13.2     236       58 21.2
# Alaska           10.0     263       48 44.5
# Delaware          5.9     238       72 15.8
# Illinois         10.4     249       83 24.0
# Louisiana        15.4     249       66 22.2
# Michigan         12.1     255       74 35.1
# Mississippi      16.1     259       44 17.1
# Nevada           12.2     252       81 46.0
# New York         11.1     254       86 26.1
# South Carolina   14.4     279       48 22.5
# 
# [[2]]
# Murder Assault UrbanPop Rape
# Arizona       8.1     294       80 31.0
# California    9.0     276       91 40.6
# Maryland     11.3     300       67 27.8
# New Mexico   11.4     285       70 32.1

Upvotes: 3

Related Questions